-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graphql update #144
Graphql update #144
Conversation
- initialize dataloader so that it only caches during a single request - also add test for graphql queries that are redundantly nested
Codecov Report
@@ Coverage Diff @@
## dev #144 +/- ##
============================================
+ Coverage 48.48% 54.24% +5.75%
- Complexity 596 758 +162
============================================
Files 142 143 +1
Lines 7039 7850 +811
Branches 817 974 +157
============================================
+ Hits 3413 4258 +845
+ Misses 3321 3232 -89
- Partials 305 360 +55
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think it's great that you have added DataLoader and cacheing to our GraphQL implementation. I do think the getResults
method signature needs to be refined or refactored for a little more clarity if possible (in addition to a few more comments here and there).
sqlBuilder.append("select *"); | ||
return getResults( | ||
getJdbcQueryDataLoaderFromContext(environment), | ||
getDataSourceFromContext(environment), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both the data loader and data source are coming from the environment
variable, can't we just pass that to getResults
and use these helper methods within that method?
@@ -75,54 +83,71 @@ public static GraphQLFieldDefinition field (String tableName) { | |||
// FIXME: NestedJDBCFetcher may need to be refactored so that it avoids conventions of JDBCFetcher (like the | |||
// implied limit of 50 records). For now, the autoLimit field has been added to JDBCFetcher, so that certain | |||
// fetchers (like the nested ones used solely for joins here) will not apply the limit by default. | |||
Map<String, Object> arguments = null; | |||
Map<String, Object> graphQLQueryArguemnts = environment.getArguments();; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: graphQLQueryArguemnts
|
||
// Store each iteration's fetch results here. | ||
List<Map<String, Object>> fetchResults = null; | ||
JDBCFetcher lastFetcher = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lastFetcher needs a comment about what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it helpful to distinguish between "last" and "previous" in variable names. Is this the last (as in final, end of a list) one, or the one seen in the previous loop iteration?
return fetchResults; | ||
// This piece of code will never be reached because of how things get returned above. | ||
// But it's here to make java happy. | ||
return new CompletableFuture<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary. You could just move the return statement that happens in the last iteration to this block. You would need to instantiate the string builder at the beginning of this block. This final return could also benefit from a description of what it ultimately contains.
preparedStatementParameters, | ||
whereConditions, | ||
fromTables, | ||
sqlStatementStringBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it strange that this string builder argument is expected to contain the first clause of the SQL query (e.g., select *
or select abc.table.*
). This isn't documented anywhere and I feel like it will just cause confusion. Couldn't we pass something a little more transparent in here?
Addresses some PR review comments in #144
All comments should be addressed and this PR should be ready to be merged. |
Thanks, @evansiroky. Now we just need @abyrd's review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit I'm kind of losing the plot on how all of this works, but I get the idea that it's cacheing responses for certain queries, then reusing the responses as needed, and somehow accumulating requests and executing them asynchronously. It would be reassuring to have some high-level summary of how this all works within the source code comments. Or maybe that already exists and I didn't see it?
Anyway it's great to see tests that demonstrate that this works as intended. I do get the impression that some things are being done asynchronously, which always raises questions about whether it will behave well with larger numbers of interleaved queries.
As a sanity check and just for explanatory purposes, it would be very interesting to have a side by side comparison of the series of SQL requests generated by a few deeply nested queries. Would it be possible to log those and include them in the PR or issue comments?
* @param dataSource The dataSource to use in connecting to a database to make sql queries. | ||
* @param query The graphQL query | ||
* @param variables The variables to apply to the graphQL query | ||
* @return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Map<String, Object> return type is very generic, so probably merits a little explanation. I guess this is returning the objects resulting from the GraphQL request. The Javadoc mentions creating the graphQL runner so my first impression was that it would return that runner. Probably worth clarifying that it creates the runner and actually submits it for execution (to an asynchronous mechanism, but blocking?) then returns the results once complete.
public class GTFSGraphQL { | ||
|
||
private static DataSource dataSource; | ||
/** | ||
* Execute a graphQL request. This method creates a new graphQL runner that is able to cache and batch sql queries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify in the comment which particular part of the following code makes it capable of batching the requests?
registry.register("jdbcfetcher", jdbcQueryDataLoader); | ||
|
||
return GraphQL.newGraphQL(GraphQLGtfsSchema.feedBasedSchema) | ||
// this instrumentation implementation will dispatch all the dataloaders as each level fo the graphql query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: 'fo' instead of 'of'. Is "dispatch all the dataloaders" common terminology for this library or for GraphQL? I don't really get what this means.
return GraphQL.newGraphQL(GraphQLGtfsSchema.feedBasedSchema) | ||
// this instrumentation implementation will dispatch all the dataloaders as each level fo the graphql query | ||
// is executed and hence make batched objects available to the query and the associated DataFetchers. This | ||
// must be created new each time to avoid caching load results in future requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Javadoc for the method mentions that we want to cache results, then here it says that we don't want to "cache load results in future requests". It's not clear to me what this means. Do you mean that we want a fresh, empty cache for each request, i.e. that we don't want cached results to be reused by future (or concurrent) requests? Is this because we expect the fetched values to change between requests? Please clarify in the comments/javadoc.
} catch (SQLException ex) { | ||
throw new RuntimeException(ex); | ||
/** | ||
* Helper method to return the DataSource from the DataFetchingEnvironment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Javadoc could be considered trivial, as it refers only to the return and parameter types. Maybe expand on how it's used rather than the types it handles - it seems that this is used to retrieve an object representing the SQL database, to allow access to it from within a GraphQL query.
* routes that serve a specific stop, starting with a top-level stop type, we can nest joins from stop ABC -> pattern | ||
* stops -> patterns -> routes (see below example implementation for more details). | ||
* This fetcher creates one big query by joining tables in the order specified in the initial definition. The data | ||
* returned will only consist of the columns from the final table definition. All other tables are assumed to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "the final table definition"? Its the word "definition" that's throwing me off... maybe it just means the last table in the chain of joins?
*/ | ||
public class NestedJDBCFetcher implements DataFetcher<List<Map<String, Object>>> { | ||
public class NestedJDBCFetcher implements DataFetcher<Object> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I'm hesitant to lose the type information. Using Object
defeats the purpose of generic types. Are there cases where we're not returning lists of maps?
|
||
// Store each iteration's fetch results here. | ||
List<Map<String, Object>> fetchResults = null; | ||
JDBCFetcher lastFetcher = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it helpful to distinguish between "last" and "previous" in variable names. Is this the last (as in final, end of a list) one, or the one seen in the previous loop iteration?
I've added in the ability to combine certain sql queries which has sped up things considerably. Posted above are the before and after results of running the query test plan in jmeter. The purple line is for a request of the following graphql query run against the TriMet feed: query nestedQuery($namespace: String) {
feed(namespace: $namespace) {
routes(limit: 1) {
patterns(limit: -1) {
route {
patterns(limit: -1) {
route {
route_id
stops(limit: -1) {
stop_id
}
}
}
route_id
stops(limit: -1) {
stop_id
}
}
}
route_id
stops(limit: -1) {
stop_id
}
}
}
} |
Just discussed with @evansiroky and @landonreed. Edited @evansiroky's comment above to clarify that the timing results are from querying against the TriMet feed. He also clarified that most of the remaining response time after the optimization is from serializing the results. The optimization seems to have reduced the SQL query time significantly, and most remaining time is spent communicating the large amount of results back to the caller. I still think it would be very interesting to include in the PR comments an example of a real-world query that could appear under the GTFS editor, with the SQL queries generated before and after the optimization (and perhaps timings). |
What is this all about anyways?This PR does 3 things to speed up GraphQL queries: implementing java-dataloader as part of making the GraphQL queries, refactoring the NestedJDBCFetcher to make less queries and combining certain SQL queries. Implementation of the DataloaderThe Dataloader is implemented to handle SQL queries. It does 2 beneficial things. First, it caches the results of queries. Therefore, if a SQL query that has been ran before is ran again in the exact same form, the results of that SQL query will be returned from a cache instead of by making a query to Postgres. Second, it batches queries together and presents them all at once which affords the opportunity to possibly combine the queries together. NestedJDBCFetcher refactorPreviously, the NestedJDBCFetcher was making many intermediate queries when only 1 query was needed that used joins. For example, to fetch all stops belonging to a route, the NestedJDBCFetcher previously made extra intermediate queries to the patterns and then pattern_stops table and then finally the stops table. Here is an example of the progression:
The refactor of the NestedJDBCFetcher builds up a query with joins that accomplish the same result in one query:
Combining certain SQL queriesAlthough lots of SQL statements are performed, one of the most common and combineable query is a basic
become
Obtuse example that highlights benefits of the above 3 improvementsThis PR adds the query ($namespace: String) {
feed(namespace: $namespace) {
feed_version
routes(limit: -1) {
route_id
stops(limit: -1) {
routes(limit: -1) {
route_id
stops(limit: -1) {
routes(limit: -1) {
route_id
stops(limit: -1) {
stop_id
}
}
stop_id
}
}
stop_id
}
}
}
} As can be seen, the stops of a route and the routes of the stops are fetched over and over. Shown below are the logs of this test before this PR was implemented:
A total of 41 SQL queries were made to the database in the above example. After:
After the improvements, the same GraphQL query required only 4 SQL queries. Let's break down what happened.
|
Just discussed and reviewed this PR a bit more with @evansiroky on a call. He checked out some before/after SQL queries generated by regular usage of data tools. We've got three optimizations here: NestedJDBCFetcher, the combining of multiple similar queries to the same table, and the cacheing of SQL query responses cascading from a single GraphQL query. We can see cases where the optimized NestedJDBCFetcher is producing fewer, simpler SQL queries using joins, e.g. when fetching all the stops in a pattern. However there don't seem to be many cases that meet the requirements for SQL combining and response caching. The original ticket #125 and the latter two optimizations are based on the idea that in the limit, it's theoretically better to do less SQL queries (assumed to be slower and to have a certain amount of overhead per query). However if that code adds complexity there is a trade-off, and we need to demonstrate that in practice it yields a perceptible increase in performance that is worth the increase in complexity. This is analogous to all the cases where some algorithm has better asymptotic complexity but a simpler algorithms performs just as well on real use cases (e.g. Fibonacci heaps or binary searches on small lists). We came to the conclusion that in the Jmeter performance tests we should replace a contrived, heavily nested test query with some queries likely to arise in the data tools / gtfs editor context that are expected to benefit from the optimizations, e.g. fetching all the shape points for all patterns in the feed. The contrived query was drowning out most of the other results because it was so much slower - with these more realistic queries we can see if the impact of the optimizations is actually of a magnitude that is perceptible. The contrived query is still great as a unit test case, for verification of the results against a snapshot to ensure that the code does what it's intended to do. |
After some thorough testing with some more commonly seen queries in the datatools application, it has been conclusively been shown that the work done here actually made performance worse. The above-listed test case with an extremely nested query is rarely if never found. Two more tests with queries taken directly from datatools were analyzed for their performance with TriMet's latest GTFS. Shown below is the performance of the codebase before this update, but with a logging improvement unrelated to GraphQL querying: And here are the results with the updates from this PR: The amount of extra overhead of batching queries made things slower than simply making many more individual queries. Therefore, this PR is being closed. |
Take some reasonable changes that came out of #144 and add them here
This PR is in response to issue #125. Its objective is to simplify and reduce the number of SQL queries generated to handle a single GraphQL query (i.e. this is an optimization which operates within a single GraphQL query; it does not cache or optimize across multiple GraphQL queries).