Skip to content

Latest commit

 

History

History
169 lines (158 loc) · 6.44 KB

TODO.md

File metadata and controls

169 lines (158 loc) · 6.44 KB

Typesense: TODO

  • Test for group + multiple fields
  • Intersect with single posting list
  • Test for erase dropping elements below compressed list threshold
  • Test for array token positions

Search index

  • Fix memory ratio (decreasing with indexing)
  • Speed up wildcard searches further
  • Allow int64 in default sorting field
  • Use connection timeout for CURL rather than request timeout
  • Async import
  • Highlight all matching fields
  • Proper JSON as input
  • Storing raw JSON input to RocksDB
  • ART for every indexed field
  • Delete should remove from RocksDB
  • Speed up UUID generation
  • Make the search score computation customizable
  • art int search should support signed ints
  • Search across multiple fields
  • Have set inside topster itself
  • Persist next_seq_id
  • collection_id should be int, not string
  • API should return count
  • Fix documents.jsonl path in tests
  • Multi field search tests
  • storage key prefix should include collection name
  • Index and search on multi-valued field
  • range search for art_int
  • Restore records as well on restart (like for meta)
  • drop collection should remove all records from the store
  • Multi-key binary search during scoring
  • Assumption that all tokens match for scoring is no longer true
  • Filters
  • Facets
  • Schema validation during insertion (missing fields + type errors)
  • Proper score field for ranking tokens
  • Throw errors when schema is broken
  • Desc/Asc ordering with tests
  • Found count is wrong
  • Filter query in the API
  • Facet limit (hardcode to top 10)
  • Deprecate old split function
  • Multiple facets not working
  • Search snippet with highlight
  • Snippet should only be around surrounding matching tokens
  • Proper pagination
  • Pagination parameter
  • Drop collection API
  • JSONP response
  • "error":"Not found." is sent when query has no hits
  • Fix API response codes
  • List all collections
  • Fetch an individual document
  • ID field should be a string: must validate
  • Number of records in collection
  • Test for asc/desc upper/lower casing
  • Test for search without any sort_by given
  • Test for collection creation validation
  • Test for delete document
  • art float search
  • When prefix=true, use default_sorting_field for token ordering only for last word
  • only last token should be prefix searched
  • Prefix-search strings should not be null terminated
  • sort results by float field
  • json::parse must be wrapped in try catch
  • Collection Manager collections map should store plain collection name
  • init_collection of Collection manager should probably take seq_id as param
  • node score should be int32, no longer uint16 like in document struct
  • Typo in prefix search
  • When field of "id" but not string, what happens?
  • test for num_documents
  • test for string filter comparison: title < "foo"
  • Test for sorted_array::indexOf when length is 0
  • Test for pagination
  • search_fields, sort_fields and facet fields should be combined
  • facet fields should be indexed verbatim
  • change "search_by" to "query_by"
  • during index_in_memory() validations should be front loaded
  • Support default sorting field being a float
  • https support
  • Validate before string to int conversion in the http api layer
  • art bool support
  • Export collection
  • get collection should show schema
  • API key should be allowed as a GET parameter also (for JSONP)
  • Don't crash when the data directory is not found
  • When the first sequence ID is not zero, bail out
  • Proper status code when sequence number to fetch is bad
  • Replica should be read-only
  • string_utils::tokenize should not have max length
  • handle hyphens (replace them)
  • clean special chars before indexing
  • Add docs/explanation around ranking calc
  • UTF-8 normalization
  • Use rocksdb batch put for atomic insertion
  • Proper logging
  • Handle store-get() not finding a key
  • Deprecate converting integer to string verbatim
  • Deprecate union type punning
  • Replica server should fail when pointed to "old" master
  • gzip compress responses
  • Have a LOG(ERROR) level
  • Handle SIGTERM which is sent when process is killed
  • Use snappy compression for storage
  • Fix exclude_scalar early returns
  • Fix result ids length during grouped overrides
  • Fix override grouping (collate_included_ids)
  • Test for overriding result on second page
  • atleast 1 token match for proceeding with drop tokens
  • support wildcard query with filters
  • API for optimizing on disk storage
  • Jemalloc
  • Exact search
  • NOT operator support
  • Log operations
  • Parameterize replica's MAX_UPDATES_TO_SEND
  • NOT operator support
  • 64K token limit
  • INT32_MAX validation for float field

  • highlight of string arrays?
  • test for token ranking on float field
  • test for float int field deletion during doc deletion
  • Test for snippets
  • Test for replication
  • Query token ids should match query token ordering
  • ID should not have "/"
  • Group results by field
  • Delete using range: https://github.com/facebook/rocksdb/wiki/Delete-A-Range-Of-Keys
  • Test for string utils
  • Prevent string copy during indexing
  • Minimum results should be a variable instead of blindly going with max_results
  • Handle searching for non-existing fields gracefully
  • test for same match score but different primary, secondary attr
  • Support nested fields via "."
  • Support search operators like +, - etc.
  • Space sensitivity
  • Use bitmap index instead of compressed array for doc list?
  • Primary_rank_scores and secondary_rank_scores hashmaps should be combined?
  • d-ary heap?
  • topster: reject min heap value compare only when field is same
  • match index instead of match score

API

  • Support the following operations:
    • create a new index
    • index a single document
    • delete a document by ID
    • query an index
    • Drop an index
    • fetch a document by ID

Clustering

  • Sync every incoming write with another Typesense server

Refactoring

  • token_count in leaf is redundant: can be accessed from value
  • storing length in offsets is redundant: it can be found by looking up value of the next index in offset_index

Tech debt

  • Use GLOB file pattern for CMake (better IDE refactoring support)
  • DRY index_int64_field* methods