Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Check if it's practical to simplify the query structure with links using links2 #1144

Open
22 tasks
Ereski opened this issue Dec 20, 2021 · 1 comment
Open
22 tasks
Assignees
Labels
enhancement New feature or request

Comments

@Ereski
Copy link
Contributor

Ereski commented Dec 20, 2021

The current query structure has multiple stages:

  • The innermost query fetches all paths in the graph that's being queried. Only the contract IDs are fetched in this step. Some limited OFFSET/LIMIT handling is also done here if required.
  • These paths are split and deduplicated into individual edges.
  • The edges are put inside a MATERIALIZED CTE to prevent PG from using some really bad query plans. Unfortunately there's no other way to create an optimization barrier that I know of.
  • The actual contract data is fetched and the link tree is reconstructed based on the edges stored previously. This step may fetch the same contract multiple times as it builds a tree and not a graph. ORDER BY, OFFSET, and LIMIT are handled here too.

This is the structure I arrived at after a lot of testing with large test DBs. "Obvious" methods like subqueries or lateral joins did significantly worse in the worst case, even if they might have been faster in some easier cases. Unfortunately, this structure also requires complex queries that can be hard to understand and debug. It is designed to "handle" postgres and keep it in the happy path.

Since moving to the links2 table and seeing how it performs compared to the previous table I've wondered if this complexity is still necessary nowadays. This is going to be a tracking issue for benchmarks and redesigns if alternatives shown to be desirable.

  • Benchmark and analyze query plans in a database with 10 million contracts for the following basic scenarios:
    • Query a must exist link that exists:
      • Current method
      • Current method without MATERIALIZE
    • Query a must exist link that does not exist:
      • Current method
      • Current method without MATERIALIZE
    • Query a may exist link that exists:
      • Current method
      • Current method without MATERIALIZE
    • Query a may exist link that does not exist:
      • Current method
      • Current method without MATERIALIZE
    • Query a must exist link pair where both exists:
      • Current method
      • Current method without MATERIALIZE
    • Query a must exist link pair where one exists:
      • Current method
      • Current method without MATERIALIZE
    • Query a must exist link pair where neither exists:
      • Current method
      • Current method without MATERIALIZE

For each scenario, test the following variants:

  • Trivial match: all contracts have their ID specified. The result set is one with one link (if applicable). This is the easiest case and any method that doesn't do well here must be burned with fire.
  • Sequential: only the root contract has its ID specified. The result set is one with 10k links (if applicable). This case tests for the fairly common case where we query for a specific card and all its links.
  • Reverse: only the links have their ID specified. The result set is 10k contracts, each with the same single link (if applicable). This case tests for the not-so-common case where we query for all cards that have a specific link. This case can also be particularly hard for methods that hardcode the order of operations, like subqueries.
  • Aggregate match: no IDs are specified. The result set is 10k contracts, each with 10k links. This case is hard and tests for underspecified queries. Only applicable for the scenarios where the links exist.
  • Aggregate unmatch: no IDs are specified. The result set is empty. This is the hardest case and will cause plenty of churn but it is useful to test how well the query planner works with the query structure in the worst case (I've seen things....). Only applicable for the scenarios where the links don't exist.

For all queries, contracts have their types and one common and indexed data field specified. Both the root and the links have the same amount of random contracts but different types. The queries must return the the full contract as JSONB to prevent postgres from muddling the results with too much optimization.

While I don't think it's useful to test these right away, the method must be able to skip, limit and order both root cards and links.

For the tests I'll use postgres 13.4 with the default configuration and with the database stored in a ramdisk.

@Ereski Ereski self-assigned this Dec 20, 2021
@Ereski Ereski added the enhancement New feature or request label Dec 20, 2021
@Ereski
Copy link
Contributor Author

Ereski commented Dec 20, 2021

Not going to work on this right away as my queue is already full.

cc @LucianBuzzo @joshbwlng @Hades32 @Page- @thgreasi if you have any ideas or interest in following ^

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant