Switch to Storage write API in BigQuery #15158

ebyhr · 2022-11-23T00:04:34Z

The lag issue between Storage read and write API is finally resolved according to https://issuetracker.google.com/issues/200589932. We should use Storage write API for better performance.

Nov 23, 2022 08:59AM Read API now supports Reading data committed from the Write API immediately. There were a few projects held out from this feature for technical reasons and we will be reaching out to let those project owners know they do not have the feature enabled.

ebyhr · 2022-11-23T00:08:06Z

cc: @hashhar @wendigo @vlad-lyutenko

hashhar · 2022-11-23T06:58:55Z

I have the change already. It doesn't appear to solve #14981 though.

So I'll submit two PRs - one to change to use Storage Write API and the other to fix how projection pushdown works for empty projections.

wendigo · 2022-11-23T08:50:18Z

Finally! Great news @ebyhr

hashhar · 2022-11-23T09:00:19Z

Also I'm documenting some things here I looked at when creating the PR:

BigQuery Storage Write API can operate in multiple modes.

Default stream

all writes are visible immediately - this means that for each PageSink#appendPage call the rows will be visible - so it's non-transactional by nature
However this matches existing implementation with insertAll API

Application created streams

Limits

Max 1k concurrent open streams
30k stream creations every 4 hours

The max open and creation limits can be handled by creating limited number of streams on coordinator and send them to workers instead of 1 per worker (streams can be shared by

1TB bytes in single commit

This implicitly caps each single write operation to max of 1TB of data written.
Other limits don't impact us as much.

Each stream provides min 1MB/s, avg 10MB/s throughput.

Modes

Pending
Committed (similar to Default Stream but with exactly once semantics)
Buffered (not for general use)

Here's how Pending mode would look like

CreateWriteStream - on coordinator, perhaps ConnectorMetadata#beginInsert/beginCreateTable?
AppendRows (PageSink#appendPage loop)
FinalizeWriteStream (PageSink#finish)
BatchCommitWriteStreams - completable future returned from PageSink#finish - actually finished on Coordinator in ConnectorMetadata#finishInsert/finishCreateTable?

This has the benefit that entire INSERT is atomic.

Conclusion

Using DefaultStream is same as existing behavior - however it's not atomic.
Pending mode can be used and maps closely with how inserts are modeled in Trino already. The benefits are higher throughput and atomic writes. Downside is more code.

rajc242 · 2022-11-24T19:18:09Z

Spark bigquery connector is also using PENDING mode. When stream starts to write a data using PENDING mode then data will store in internal buffer not visible to outside world but when stream finalize and closed then buffer will be visible and data will be available to read. You can also check how much estimated bytes stores in buffer. Eventually buffer will be merged to actual bigquery storage.

    BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
    TableId tableId = TableId.of(projectId, datasetName, tableName);
    Table table =
        bigquery.getTable(
            tableId, BigQuery.TableOption.fields(BigQuery.TableField.STREAMING_BUFFER));
    StandardTableDefinition standardTableDefinition = table.getDefinition();
    StandardTableDefinition.StreamingBuffer streamingBuffer =
        standardTableDefinition.getStreamingBuffer();
    Long estimatedBytes = streamingBuffer.getEstimatedBytes();

ebyhr · 2023-04-03T13:19:16Z

@hashhar Can you send a PR when you have time? I guess switching to storage API improves CI speed.

regadas · 2023-06-01T18:03:54Z

Hi! I was looking to add support for this as well. I'm glad I stumbled on this!

@hashhar it would be great if you could push what you have. Thanks 🙏

regadas · 2023-06-09T18:50:17Z

Poking again on this issue 😄 and hopefully get some traction.

We are starting to rely heavily on the BigQuery connector and because it relies on the Streaming API we are seeing not only poor performance but also raising costs. Streaming API is not cheap and if you do a lot of ingestion it really becomes a problem.

I'll most likely start tackling this next week. If there's something shareable on your side, I would appreciate it if you could share it 😄 We do have a lot of load, and I'm happy to put it through its paces.

hashhar · 2024-01-09T09:23:20Z

Partially done via #18897 by using the DefaultStream.

Parts that remain are to see if we can switch to Pending mode as described at #15158 (comment)

hashhar · 2024-02-28T06:54:18Z

A PR for PENDING mode is now up at #20852

hashhar self-assigned this Nov 23, 2022

ebyhr mentioned this issue Nov 28, 2022

Enable testInsertForDefaultColumn in BigQuery connector test #15208

Closed

ebyhr mentioned this issue Jan 10, 2023

Implement ALTER COLUMN SET DATA TYPE statement in connectors #15515

Open

10 tasks

hashhar linked a pull request Feb 28, 2024 that will close this issue

Use PENDING mode in BigQuery writes #20852

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to Storage write API in BigQuery #15158

Switch to Storage write API in BigQuery #15158

ebyhr commented Nov 23, 2022 •

edited

Loading

ebyhr commented Nov 23, 2022

hashhar commented Nov 23, 2022

wendigo commented Nov 23, 2022

hashhar commented Nov 23, 2022

rajc242 commented Nov 24, 2022

ebyhr commented Apr 3, 2023

regadas commented Jun 1, 2023

regadas commented Jun 9, 2023

hashhar commented Jan 9, 2024

hashhar commented Feb 28, 2024

Switch to Storage write API in BigQuery #15158

Switch to Storage write API in BigQuery #15158

Comments

ebyhr commented Nov 23, 2022 • edited Loading

ebyhr commented Nov 23, 2022

hashhar commented Nov 23, 2022

wendigo commented Nov 23, 2022

hashhar commented Nov 23, 2022

Default stream

Application created streams

Limits

Modes

Conclusion

rajc242 commented Nov 24, 2022

ebyhr commented Apr 3, 2023

regadas commented Jun 1, 2023

regadas commented Jun 9, 2023

hashhar commented Jan 9, 2024

hashhar commented Feb 28, 2024

ebyhr commented Nov 23, 2022 •

edited

Loading