Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphql update #144

Closed
wants to merge 12 commits into from
Closed

Graphql update #144

wants to merge 12 commits into from

Conversation

evansiroky
Copy link
Contributor

@evansiroky evansiroky commented Oct 1, 2018

This PR is in response to issue #125. Its objective is to simplify and reduce the number of SQL queries generated to handle a single GraphQL query (i.e. this is an optimization which operates within a single GraphQL query; it does not cache or optimize across multiple GraphQL queries).

  • Update graphql-java from version 4.2 to 10.0
  • Refactor NestedJDBCFetcher to make queries with joins instead of numerous individual queries
  • Add dataloader integration to batch queries and cache sql queries
  • Refactor initialization of graphQL executions

@codecov-io
Copy link

codecov-io commented Oct 1, 2018

Codecov Report

Merging #144 into dev will increase coverage by 5.75%.
The diff coverage is 81.66%.

Impacted file tree graph

@@             Coverage Diff              @@
##                dev     #144      +/-   ##
============================================
+ Coverage     48.48%   54.24%   +5.75%     
- Complexity      596      758     +162     
============================================
  Files           142      143       +1     
  Lines          7039     7850     +811     
  Branches        817      974     +157     
============================================
+ Hits           3413     4258     +845     
+ Misses         3321     3232      -89     
- Partials        305      360      +55
Impacted Files Coverage Δ Complexity Δ
...veyal/gtfs/graphql/fetchers/ErrorCountFetcher.java 8% <0%> (-0.34%) 2 <0> (ø)
...om/conveyal/gtfs/graphql/fetchers/FeedFetcher.java 80% <100%> (+1.73%) 4 <0> (ø) ⬇️
...in/java/com/conveyal/gtfs/graphql/GraphQLUtil.java 96.55% <100%> (+0.25%) 8 <2> (+2) ⬆️
...a/com/conveyal/gtfs/graphql/GraphQLGtfsSchema.java 98.25% <100%> (ø) 1 <0> (ø) ⬇️
...onveyal/gtfs/graphql/fetchers/RowCountFetcher.java 67.77% <62.5%> (-0.41%) 11 <0> (ø)
...om/conveyal/gtfs/graphql/fetchers/JDBCFetcher.java 65.89% <79.69%> (+9.3%) 30 <22> (+5) ⬆️
...nveyal/gtfs/graphql/fetchers/SQLColumnFetcher.java 86.66% <83.33%> (-13.34%) 4 <2> (ø)
...veyal/gtfs/graphql/fetchers/NestedJDBCFetcher.java 79.06% <91.66%> (+9.3%) 6 <0> (-2) ⬇️
...in/java/com/conveyal/gtfs/graphql/GTFSGraphQL.java 88.88% <92.3%> (+22.22%) 4 <4> (+1) ⬆️
...in/java/com/conveyal/gtfs/loader/BatchTracker.java 75% <0%> (ø) 6% <0%> (?)
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f98abe5...5a7c529. Read the comment docs.

Copy link
Contributor

@landonreed landonreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think it's great that you have added DataLoader and cacheing to our GraphQL implementation. I do think the getResults method signature needs to be refined or refactored for a little more clarity if possible (in addition to a few more comments here and there).

sqlBuilder.append("select *");
return getResults(
getJdbcQueryDataLoaderFromContext(environment),
getDataSourceFromContext(environment),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both the data loader and data source are coming from the environment variable, can't we just pass that to getResults and use these helper methods within that method?

@@ -75,54 +83,71 @@ public static GraphQLFieldDefinition field (String tableName) {
// FIXME: NestedJDBCFetcher may need to be refactored so that it avoids conventions of JDBCFetcher (like the
// implied limit of 50 records). For now, the autoLimit field has been added to JDBCFetcher, so that certain
// fetchers (like the nested ones used solely for joins here) will not apply the limit by default.
Map<String, Object> arguments = null;
Map<String, Object> graphQLQueryArguemnts = environment.getArguments();;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: graphQLQueryArguemnts


// Store each iteration's fetch results here.
List<Map<String, Object>> fetchResults = null;
JDBCFetcher lastFetcher = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lastFetcher needs a comment about what it does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it helpful to distinguish between "last" and "previous" in variable names. Is this the last (as in final, end of a list) one, or the one seen in the previous loop iteration?

return fetchResults;
// This piece of code will never be reached because of how things get returned above.
// But it's here to make java happy.
return new CompletableFuture<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary. You could just move the return statement that happens in the last iteration to this block. You would need to instantiate the string builder at the beginning of this block. This final return could also benefit from a description of what it ultimately contains.

preparedStatementParameters,
whereConditions,
fromTables,
sqlStatementStringBuilder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it strange that this string builder argument is expected to contain the first clause of the SQL query (e.g., select * or select abc.table.*). This isn't documented anywhere and I feel like it will just cause confusion. Couldn't we pass something a little more transparent in here?

@landonreed landonreed assigned evansiroky and unassigned landonreed Oct 11, 2018
@evansiroky
Copy link
Contributor Author

All comments should be addressed and this PR should be ready to be merged.

@evansiroky evansiroky assigned landonreed and unassigned evansiroky Oct 15, 2018
@landonreed landonreed assigned abyrd and unassigned landonreed Oct 15, 2018
@landonreed
Copy link
Contributor

Thanks, @evansiroky. Now we just need @abyrd's review.

Copy link
Member

@abyrd abyrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit I'm kind of losing the plot on how all of this works, but I get the idea that it's cacheing responses for certain queries, then reusing the responses as needed, and somehow accumulating requests and executing them asynchronously. It would be reassuring to have some high-level summary of how this all works within the source code comments. Or maybe that already exists and I didn't see it?

Anyway it's great to see tests that demonstrate that this works as intended. I do get the impression that some things are being done asynchronously, which always raises questions about whether it will behave well with larger numbers of interleaved queries.

As a sanity check and just for explanatory purposes, it would be very interesting to have a side by side comparison of the series of SQL requests generated by a few deeply nested queries. Would it be possible to log those and include them in the PR or issue comments?

* @param dataSource The dataSource to use in connecting to a database to make sql queries.
* @param query The graphQL query
* @param variables The variables to apply to the graphQL query
* @return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Map<String, Object> return type is very generic, so probably merits a little explanation. I guess this is returning the objects resulting from the GraphQL request. The Javadoc mentions creating the graphQL runner so my first impression was that it would return that runner. Probably worth clarifying that it creates the runner and actually submits it for execution (to an asynchronous mechanism, but blocking?) then returns the results once complete.

public class GTFSGraphQL {

private static DataSource dataSource;
/**
* Execute a graphQL request. This method creates a new graphQL runner that is able to cache and batch sql queries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify in the comment which particular part of the following code makes it capable of batching the requests?

registry.register("jdbcfetcher", jdbcQueryDataLoader);

return GraphQL.newGraphQL(GraphQLGtfsSchema.feedBasedSchema)
// this instrumentation implementation will dispatch all the dataloaders as each level fo the graphql query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: 'fo' instead of 'of'. Is "dispatch all the dataloaders" common terminology for this library or for GraphQL? I don't really get what this means.

return GraphQL.newGraphQL(GraphQLGtfsSchema.feedBasedSchema)
// this instrumentation implementation will dispatch all the dataloaders as each level fo the graphql query
// is executed and hence make batched objects available to the query and the associated DataFetchers. This
// must be created new each time to avoid caching load results in future requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc for the method mentions that we want to cache results, then here it says that we don't want to "cache load results in future requests". It's not clear to me what this means. Do you mean that we want a fresh, empty cache for each request, i.e. that we don't want cached results to be reused by future (or concurrent) requests? Is this because we expect the fetched values to change between requests? Please clarify in the comments/javadoc.

} catch (SQLException ex) {
throw new RuntimeException(ex);
/**
* Helper method to return the DataSource from the DataFetchingEnvironment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Javadoc could be considered trivial, as it refers only to the return and parameter types. Maybe expand on how it's used rather than the types it handles - it seems that this is used to retrieve an object representing the SQL database, to allow access to it from within a GraphQL query.

* routes that serve a specific stop, starting with a top-level stop type, we can nest joins from stop ABC -> pattern
* stops -> patterns -> routes (see below example implementation for more details).
* This fetcher creates one big query by joining tables in the order specified in the initial definition. The data
* returned will only consist of the columns from the final table definition. All other tables are assumed to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "the final table definition"? Its the word "definition" that's throwing me off... maybe it just means the last table in the chain of joins?

*/
public class NestedJDBCFetcher implements DataFetcher<List<Map<String, Object>>> {
public class NestedJDBCFetcher implements DataFetcher<Object> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I'm hesitant to lose the type information. Using Object defeats the purpose of generic types. Are there cases where we're not returning lists of maps?


// Store each iteration's fetch results here.
List<Map<String, Object>> fetchResults = null;
JDBCFetcher lastFetcher = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it helpful to distinguish between "last" and "previous" in variable names. Is this the last (as in final, end of a list) one, or the one seen in the previous loop iteration?

@evansiroky evansiroky assigned evansiroky and unassigned abyrd Oct 17, 2018
@evansiroky evansiroky added the WIP Work in progress label Oct 17, 2018
@evansiroky evansiroky removed the WIP Work in progress label Oct 19, 2018
@evansiroky
Copy link
Contributor Author

Before:

screen shot 2018-10-19 at 12 02 52 am

After:

screen shot 2018-10-18 at 11 20 20 pm

@evansiroky evansiroky assigned abyrd and unassigned evansiroky Oct 19, 2018
@evansiroky
Copy link
Contributor Author

evansiroky commented Oct 19, 2018

I've added in the ability to combine certain sql queries which has sped up things considerably. Posted above are the before and after results of running the query test plan in jmeter. The purple line is for a request of the following graphql query run against the TriMet feed:

query nestedQuery($namespace: String) {
  feed(namespace: $namespace) {
    routes(limit: 1) {
      patterns(limit: -1) {
        route {
          patterns(limit: -1) {
            route {
              route_id
              stops(limit: -1) {
                stop_id
              }
            }
          }
          route_id
          stops(limit: -1) {
            stop_id
          }
        }
      }
      route_id
      stops(limit: -1) {
        stop_id
      }
    }
  }
}

@abyrd
Copy link
Member

abyrd commented Oct 25, 2018

Just discussed with @evansiroky and @landonreed. Edited @evansiroky's comment above to clarify that the timing results are from querying against the TriMet feed. He also clarified that most of the remaining response time after the optimization is from serializing the results. The optimization seems to have reduced the SQL query time significantly, and most remaining time is spent communicating the large amount of results back to the caller.

I still think it would be very interesting to include in the PR comments an example of a real-world query that could appear under the GTFS editor, with the SQL queries generated before and after the optimization (and perhaps timings).

@evansiroky
Copy link
Contributor Author

What is this all about anyways?

This PR does 3 things to speed up GraphQL queries: implementing java-dataloader as part of making the GraphQL queries, refactoring the NestedJDBCFetcher to make less queries and combining certain SQL queries.

Implementation of the Dataloader

The Dataloader is implemented to handle SQL queries. It does 2 beneficial things.

First, it caches the results of queries. Therefore, if a SQL query that has been ran before is ran again in the exact same form, the results of that SQL query will be returned from a cache instead of by making a query to Postgres.

Second, it batches queries together and presents them all at once which affords the opportunity to possibly combine the queries together.

NestedJDBCFetcher refactor

Previously, the NestedJDBCFetcher was making many intermediate queries when only 1 query was needed that used joins. For example, to fetch all stops belonging to a route, the NestedJDBCFetcher previously made extra intermediate queries to the patterns and then pattern_stops table and then finally the stops table. Here is an example of the progression:

select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where route_id = '1'
select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where pattern_id = '1'
select * from pree_vlfdxmuecxqgctfqnntwvc.stops where stop_id in ('4u6g', 'johv')

The refactor of the NestedJDBCFetcher builds up a query with joins that accomplish the same result in one query:

select jmlh_jsohdsfkvjlgwulkbfnyne.stops.* 
from jmlh_jsohdsfkvjlgwulkbfnyne.patterns, 
  jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops, 
  jmlh_jsohdsfkvjlgwulkbfnyne.stops 
where jmlh_jsohdsfkvjlgwulkbfnyne.patterns.route_id = '1' and
jmlh_jsohdsfkvjlgwulkbfnyne.patterns.pattern_id = jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.pattern_id and 
jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.stop_id = jmlh_jsohdsfkvjlgwulkbfnyne.stops.stop_id

Combining certain SQL queries

Although lots of SQL statements are performed, one of the most common and combineable query is a basic select * from tablename where id = 'id'. Those kinds of queries are combined together such that these two queries:

select * from tablename where id = '1'
select * from tablename where id = '2'

become

select * from tablename where id in ('1', '2')

Obtuse example that highlights benefits of the above 3 improvements

This PR adds the canFetchMultiNestedEntitiesWithoutLimits test to the GTFSGraphQLTest file. This test involves selecting redundant data over and over. Shown below is the GraphQL query made in that test:

query ($namespace: String) {
  feed(namespace: $namespace) {
    feed_version
    routes(limit: -1) {
      route_id
      stops(limit: -1) {
        routes(limit: -1) {
          route_id
          stops(limit: -1) {
            routes(limit: -1) {
              route_id
              stops(limit: -1) {
                stop_id
              }
            }
            stop_id
          }
        }
        stop_id
      }
    }
  }
}

As can be seen, the stops of a route and the routes of the stops are fetched over and over. Shown below are the logs of this test before this PR was implemented:

3068 [Time-limited test] INFO graphql.schema.DataFetcher - SQL: select * from feeds where namespace = 'pree_vlfdxmuecxqgctfqnntwvc'
3078 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.routes
3079 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3089 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3089 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where route_id = '1'
3091 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3091 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3092 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where pattern_id = '1'
3093 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3093 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - stops args: [limit]
3093 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [4u6g, johv]
3094 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.stops where stop_id in ('4u6g', 'johv')
3096 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3096 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [4u6g]
3096 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where stop_id = '4u6g'
3097 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3097 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3097 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where pattern_id = '1'
3098 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3098 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - routes args: [limit]
3098 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3098 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.routes where route_id = '1'
3099 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3099 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3099 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where route_id = '1'
3099 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3099 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3100 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where pattern_id = '1'
3100 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3100 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - stops args: [limit]
3100 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [4u6g, johv]
3100 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.stops where stop_id in ('4u6g', 'johv')
3101 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3102 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [4u6g]
3102 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where stop_id = '4u6g'
3102 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3102 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3102 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where pattern_id = '1'
3103 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3103 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - routes args: [limit]
3103 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3103 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.routes where route_id = '1'
3103 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3104 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3104 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where route_id = '1'
3104 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3104 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3104 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where pattern_id = '1'
3105 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3105 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - stops args: [limit]
3105 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [4u6g, johv]
3105 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.stops where stop_id in ('4u6g', 'johv')
3106 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3108 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [johv]
3109 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where stop_id = 'johv'
3109 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3109 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3109 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where pattern_id = '1'
3110 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3110 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - routes args: [limit]
3110 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3110 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.routes where route_id = '1'
3110 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3111 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3111 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where route_id = '1'
3111 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3111 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3111 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where pattern_id = '1'
3112 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3112 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - stops args: [limit]
3112 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [4u6g, johv]
3112 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.stops where stop_id in ('4u6g', 'johv')
3113 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3113 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [johv]
3113 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where stop_id = 'johv'
3114 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3114 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3114 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where pattern_id = '1'
3115 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3115 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - routes args: [limit]
3115 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3115 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.routes where route_id = '1'
3115 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3115 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3116 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where route_id = '1'
3116 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3116 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3116 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where pattern_id = '1'
3117 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3117 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - stops args: [limit]
3117 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [4u6g, johv]
3117 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.stops where stop_id in ('4u6g', 'johv')
3118 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3118 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [4u6g]
3118 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where stop_id = '4u6g'
3119 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3119 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3119 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where pattern_id = '1'
3119 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3119 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - routes args: [limit]
3119 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3119 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.routes where route_id = '1'
3120 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3120 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3120 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where route_id = '1'
3121 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3121 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3121 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where pattern_id = '1'
3121 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3121 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - stops args: [limit]
3121 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [4u6g, johv]
3121 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.stops where stop_id in ('4u6g', 'johv')
3122 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3122 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [johv]
3122 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where stop_id = 'johv'
3123 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3123 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3123 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where pattern_id = '1'
3123 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3123 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - routes args: [limit]
3123 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3123 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.routes where route_id = '1'
3124 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3124 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3124 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where route_id = '1'
3124 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3124 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [1]
3125 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where pattern_id = '1'
3125 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3125 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - stops args: [limit]
3125 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.NestedJDBCFetcher - Join values: [4u6g, johv]
3125 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from pree_vlfdxmuecxqgctfqnntwvc.stops where stop_id in ('4u6g', 'johv')
3126 [Time-limited test] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2

A total of 41 SQL queries were made to the database in the above example.

After:

2854 [Time-limited test] INFO com.conveyal.gtfs.graphql.GTFSGraphQLTest - Query GraphQL with query: superNestedNoLimits.txt
3283 [Time-limited test] INFO graphql.schema.DataFetcher - SQL: select * from feeds where namespace = 'jmlh_jsohdsfkvjlgwulkbfnyne'
3292 [ForkJoinPool.commonPool-worker-1] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select * from jmlh_jsohdsfkvjlgwulkbfnyne.routes
3294 [ForkJoinPool.commonPool-worker-1] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 1
3299 [ForkJoinPool.commonPool-worker-1] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select jmlh_jsohdsfkvjlgwulkbfnyne.stops.* from jmlh_jsohdsfkvjlgwulkbfnyne.patterns, jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops, jmlh_jsohdsfkvjlgwulkbfnyne.stops where jmlh_jsohdsfkvjlgwulkbfnyne.patterns.route_id = '1' and jmlh_jsohdsfkvjlgwulkbfnyne.patterns.pattern_id = jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.pattern_id and jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.stop_id = jmlh_jsohdsfkvjlgwulkbfnyne.stops.stop_id
3301 [ForkJoinPool.commonPool-worker-1] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2
3304 [ForkJoinPool.commonPool-worker-2] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - SQL: select jmlh_jsohdsfkvjlgwulkbfnyne.routes.*, jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.stop_id as jmlh_jsohdsfkvjlgwulkbfnyne_pattern_stops_stop_id_alias from jmlh_jsohdsfkvjlgwulkbfnyne.routes, jmlh_jsohdsfkvjlgwulkbfnyne.patterns, jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops where jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.pattern_id = jmlh_jsohdsfkvjlgwulkbfnyne.patterns.pattern_id and jmlh_jsohdsfkvjlgwulkbfnyne.patterns.route_id = jmlh_jsohdsfkvjlgwulkbfnyne.routes.route_id and jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.stop_id IN ('4u6g', 'johv')
3305 [ForkJoinPool.commonPool-worker-2] INFO com.conveyal.gtfs.graphql.fetchers.JDBCFetcher - Result size: 2

After the improvements, the same GraphQL query required only 4 SQL queries. Let's break down what happened.

  • The first query for the feeds remains in both tests: select * from feeds where namespace = 'jmlh_jsohdsfkvjlgwulkbfnyne'
  • The next query to get all the routes of the feeds remains in both tests: select * from pree_vlfdxmuecxqgctfqnntwvc.routes
  • The queries to fetch the stops of the routes are combined into one query in the test of this PR:
    select jmlh_jsohdsfkvjlgwulkbfnyne.stops.* from jmlh_jsohdsfkvjlgwulkbfnyne.patterns, jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops, jmlh_jsohdsfkvjlgwulkbfnyne.stops where jmlh_jsohdsfkvjlgwulkbfnyne.patterns.route_id = '1' and jmlh_jsohdsfkvjlgwulkbfnyne.patterns.pattern_id = jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.pattern_id and jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.stop_id = jmlh_jsohdsfkvjlgwulkbfnyne.stops.stop_id.
    This is in contrast to the 3 queries required by the previous code: select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where route_id = '1', select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where pattern_id = '1' and select * from pree_vlfdxmuecxqgctfqnntwvc.stops where stop_id in ('4u6g', 'johv').
  • The next set of queries needs to fetch all of the routes associated with each stop. The previous version needed to make a unique query for the routes of each stop. So the previous version made 6 queries to accomplish this. Doing this in only 6 queries is likely the best possible case because the test feed contains only 1 route and 2 stops. For every stop 3 queries needed to be made to get the route. For larger feeds, the number of queries could easily reach into the thousands for this step. Here are the 6 queries: select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where stop_id = '4u6g', select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where pattern_id = '1', select * from pree_vlfdxmuecxqgctfqnntwvc.routes where route_id = '1', select * from pree_vlfdxmuecxqgctfqnntwvc.pattern_stops where stop_id = 'johv', select * from pree_vlfdxmuecxqgctfqnntwvc.patterns where pattern_id = '1' and select * from pree_vlfdxmuecxqgctfqnntwvc.routes where route_id = '1'. This PR turns these 6 queries into 1 query: select jmlh_jsohdsfkvjlgwulkbfnyne.routes.*, jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.stop_id as jmlh_jsohdsfkvjlgwulkbfnyne_pattern_stops_stop_id_alias from jmlh_jsohdsfkvjlgwulkbfnyne.routes, jmlh_jsohdsfkvjlgwulkbfnyne.patterns, jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops where jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.pattern_id = jmlh_jsohdsfkvjlgwulkbfnyne.patterns.pattern_id and jmlh_jsohdsfkvjlgwulkbfnyne.patterns.route_id = jmlh_jsohdsfkvjlgwulkbfnyne.routes.route_id and jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.stop_id IN ('4u6g', 'johv'). The first thing here is that the NestedJDBCFetcher refactor is making it possible to make 1 query to select the routes of the stops. The second thing happening is that the queries created by the NestedJDBCFetcher are being combined together into one single query that has an IN clause during the batch execution of queries in the dataloader.
  • And that was the 4th and final query made during this test in this PR. The next round of queries now fetches redundant data. It needs to get the stops of the routes. The resulting query that is created in this PR to fetch that data is select jmlh_jsohdsfkvjlgwulkbfnyne.stops.* from jmlh_jsohdsfkvjlgwulkbfnyne.patterns, jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops, jmlh_jsohdsfkvjlgwulkbfnyne.stops where jmlh_jsohdsfkvjlgwulkbfnyne.patterns.route_id = '1' and jmlh_jsohdsfkvjlgwulkbfnyne.patterns.pattern_id = jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.pattern_id and jmlh_jsohdsfkvjlgwulkbfnyne.pattern_stops.stop_id = jmlh_jsohdsfkvjlgwulkbfnyne.stops.stop_id. However, that exact query has already been made. Therefore, the result of that query is extracted from the Dataloader cache. In the previous code, all of this data needed to get fetched all over again from the SQL database. The same thing continues to happen repeatedly which means fetching results from the dataloader cache in this PR or making more SQL queries in the old code.

@abyrd
Copy link
Member

abyrd commented Oct 29, 2018

Just discussed and reviewed this PR a bit more with @evansiroky on a call. He checked out some before/after SQL queries generated by regular usage of data tools.

We've got three optimizations here: NestedJDBCFetcher, the combining of multiple similar queries to the same table, and the cacheing of SQL query responses cascading from a single GraphQL query. We can see cases where the optimized NestedJDBCFetcher is producing fewer, simpler SQL queries using joins, e.g. when fetching all the stops in a pattern. However there don't seem to be many cases that meet the requirements for SQL combining and response caching.

The original ticket #125 and the latter two optimizations are based on the idea that in the limit, it's theoretically better to do less SQL queries (assumed to be slower and to have a certain amount of overhead per query). However if that code adds complexity there is a trade-off, and we need to demonstrate that in practice it yields a perceptible increase in performance that is worth the increase in complexity. This is analogous to all the cases where some algorithm has better asymptotic complexity but a simpler algorithms performs just as well on real use cases (e.g. Fibonacci heaps or binary searches on small lists).

We came to the conclusion that in the Jmeter performance tests we should replace a contrived, heavily nested test query with some queries likely to arise in the data tools / gtfs editor context that are expected to benefit from the optimizations, e.g. fetching all the shape points for all patterns in the feed. The contrived query was drowning out most of the other results because it was so much slower - with these more realistic queries we can see if the impact of the optimizations is actually of a magnitude that is perceptible.

The contrived query is still great as a unit test case, for verification of the results against a snapshot to ensure that the code does what it's intended to do.

@evansiroky
Copy link
Contributor Author

After some thorough testing with some more commonly seen queries in the datatools application, it has been conclusively been shown that the work done here actually made performance worse. The above-listed test case with an extremely nested query is rarely if never found. Two more tests with queries taken directly from datatools were analyzed for their performance with TriMet's latest GTFS. Shown below is the performance of the codebase before this update, but with a logging improvement unrelated to GraphQL querying:

screen shot 2018-10-29 at 10 28 53 pm

And here are the results with the updates from this PR:

screen shot 2018-10-30 at 8 16 14 am

The amount of extra overhead of batching queries made things slower than simply making many more individual queries. Therefore, this PR is being closed.

@evansiroky evansiroky closed this Nov 8, 2018
evansiroky added a commit that referenced this pull request Nov 20, 2018
Take some reasonable changes that came out of #144 and add them here
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants