-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to Storage write API in BigQuery #15158
Comments
I have the change already. It doesn't appear to solve #14981 though. So I'll submit two PRs - one to change to use Storage Write API and the other to fix how projection pushdown works for empty projections. |
Finally! Great news @ebyhr |
Also I'm documenting some things here I looked at when creating the PR: BigQuery Storage Write API can operate in multiple modes. Default streamall writes are visible immediately - this means that for each Application created streamsLimits
The max open and creation limits can be handled by creating limited number of streams on coordinator and send them to workers instead of 1 per worker (streams can be shared by
This implicitly caps each single write operation to max of 1TB of data written.
Modes
Here's how
This has the benefit that entire INSERT is atomic. ConclusionUsing |
Spark bigquery connector is also using PENDING mode. When stream starts to write a data using PENDING mode then data will store in internal buffer not visible to outside world but when stream finalize and closed then buffer will be visible and data will be available to read. You can also check how much estimated bytes stores in buffer. Eventually buffer will be merged to actual bigquery storage. BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
TableId tableId = TableId.of(projectId, datasetName, tableName);
Table table =
bigquery.getTable(
tableId, BigQuery.TableOption.fields(BigQuery.TableField.STREAMING_BUFFER));
StandardTableDefinition standardTableDefinition = table.getDefinition();
StandardTableDefinition.StreamingBuffer streamingBuffer =
standardTableDefinition.getStreamingBuffer();
Long estimatedBytes = streamingBuffer.getEstimatedBytes(); |
@hashhar Can you send a PR when you have time? I guess switching to storage API improves CI speed. |
Hi! I was looking to add support for this as well. I'm glad I stumbled on this! @hashhar it would be great if you could push what you have. Thanks 🙏 |
Poking again on this issue 😄 and hopefully get some traction. We are starting to rely heavily on the BigQuery connector and because it relies on the Streaming API we are seeing not only poor performance but also raising costs. Streaming API is not cheap and if you do a lot of ingestion it really becomes a problem. I'll most likely start tackling this next week. If there's something shareable on your side, I would appreciate it if you could share it 😄 We do have a lot of load, and I'm happy to put it through its paces. |
Partially done via #18897 by using the Parts that remain are to see if we can switch to |
A PR for |
The lag issue between Storage read and write API is finally resolved according to https://issuetracker.google.com/issues/200589932. We should use Storage write API for better performance.
The text was updated successfully, but these errors were encountered: