Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS #18

Open
CMBCKSRL opened this issue May 24, 2023 · 39 comments
Open

GCS #18

CMBCKSRL opened this issue May 24, 2023 · 39 comments

Comments

@CMBCKSRL
Copy link

CMBCKSRL commented May 24, 2023

I am trying to use parquet_s3_fdw to connect to my GCS bucket and extract data from parquet files but it seems to be impossible (or I've made a mistake in my code).

here is what I do

Firstly, I create EXTENSION
CREATE EXTENSION parquet_s3_fdw;

Than I create server
CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-west1');

My GCS bucket region is us-west1 (Oregon) but I also tried us-west2.

Afterwards, I create user mapping
CREATE USER MAPPING FOR CURRENT_USER SERVER parquet_s3_srv OPTIONS (user '<access_key>', password '<secret_key>');

I don't think that there is a problem with these keys because I was able to access to my bucket from ClickHouse.

In the end I create foreign table

CREATE FOREIGN TABLE natality_parquet (
  source_year TEXT,
  year TEXT,
  month TEXT,
  day TEXT,
  wday TEXT,
  state TEXT,
  is_male TEXT,
  child_race TEXT,
  weight_pounds TEXT,
  plurality TEXT,
  apgar_1min TEXT,
  apgar_5min TEXT,
  mother_residence_state TEXT,
  mother_race TEXT,
  mother_age TEXT,
  gestation_weeks TEXT,
  lmp TEXT,
  mother_married TEXT,
  mother_birth_state TEXT,
  cigarette_use TEXT,
  cigarettes_per_day TEXT,
  alcohol_use TEXT,
  drinks_per_week TEXT,
  weight_gain_pounds TEXT,
  born_alive_alive TEXT,
  born_alive_dead TEXT,
  born_dead TEXT,
  ever_born TEXT,
  father_race TEXT,
  father_age TEXT,
  record_weight TEXT
) SERVER parquet_s3_srv
OPTIONS (
  filename 's3://example_bucket_natality2/000000000000.parquet'
);

But when I query this foreign table I get this error
select * from natality_parquet limit 5;

SQL Error [XX000]: ERROR: parquet_s3_fdw: failed to exctract row groups from Parquet file: failed to open Parquet file HeadObject failed

Is it actually possible to access to GCS via parquet_s3_fdw? If it is true, than could you please point me where am I mistaken in my code

@hrkuma
Copy link

hrkuma commented Jun 26, 2023

Hi,

Is it actually possible to access to GCS via parquet_s3_fdw? If it is true, than could you please point me where am I mistaken in my code

Actually we have not tried GCS with this FDW.
So we're sorry for not providing valuable information at this moment.

@wuputah
Copy link

wuputah commented Jun 29, 2023

Is it possible to use the endpoint option to connect to non-S3 services? Perhaps you need to enable use_minio but if the API is the same, it may work?

@hrkuma
Copy link

hrkuma commented Jun 30, 2023

Because we did not find the exact information that AWS SDK C++ can work with GCS, by current implementation using AWS SDK, I cannot tell it may work or not.
About the use_minio option, because the schema is hard coded with "Aws::Http::Schema::HTTP" in parquet_s3_fdw_connection.cpp, maybe we should fix it to HTTPS at least on GCS.

@mausch
Copy link

mausch commented Jul 31, 2023

FWIW I'm getting the same error with a parquet file on AWS S3 so this is not just about GCS.
A different parquet file in the same S3 bucket works fine though.

@ZhiXingHeYiApple
Copy link

ZhiXingHeYiApple commented Oct 19, 2023

FWIW I'm getting the same error with a parquet file on Ali OSS. I roughly looked at the source code and suspect that this piece of code has an issue.

parquet_s3_fdw_connection.cpp

if (use_minio)
	{
		const Aws::String defaultEndpoint = "127.0.0.1:9000";
		clientConfig.scheme = Aws::Http::Scheme::HTTP;
		clientConfig.endpointOverride = endpoint ? (Aws::String) endpoint : defaultEndpoint;
		s3_client = new Aws::S3::S3Client(cred, clientConfig,
				Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never, false);
	}
	else
	{
		const Aws::String defaultRegion = "ap-northeast-1";
		clientConfig.scheme = Aws::Http::Scheme::HTTPS;
		clientConfig.region = awsRegion ? (Aws::String) awsRegion : defaultRegion;
		s3_client = new Aws::S3::S3Client(cred, clientConfig);
	}

I think else block of code absent clientConfig.endpointOverride = endpoint ? (Aws::String) endpoint. May be correct code as below:

else
	{
		const Aws::String defaultRegion = "ap-northeast-1";
		clientConfig.scheme = Aws::Http::Scheme::HTTPS;
		clientConfig.region = awsRegion ? (Aws::String) awsRegion : defaultRegion;
                 // May be there are default value(on AWS S3) for endpoint of clientConfig, but if you use GCS or Ali OSS, you should specify the endpoint from external configuration.
                 clientConfig.endpointOverride = (Aws::String) endpoint;
		s3_client = new Aws::S3::S3Client(cred, clientConfig);
	}

@vitabaks
Copy link

+1 to support GCS.

@vitabaks
Copy link

vitabaks commented Nov 24, 2023

AWS S3 (installed according to the instruction)

postgres=# CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-west1');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR public SERVER parquet_s3_srv OPTIONS (user '*******', password '********');
CREATE USER MAPPING
postgres=# CREATE FOREIGN TABLE test_table (
    c1 INT2 OPTIONS (key 'true'),
    c2 TEXT,
    c3 BOOLEAN
) SERVER parquet_s3_srv OPTIONS (filename 's3://vitabaks/parquet-test/test_table.parquet');
CREATE FOREIGN TABLE
postgres=# INSERT INTO test_table VALUES (1, 'text1', true), (2, DEFAULT, false), ((select 3), (select i from (values('values are fun!')) as foo (i)), true);
ERROR:  parquet_s3_fdw: failed to open Parquet file HeadObject failed
postgres=# 

access to this bucket (using the same Key and Secret that were specified in USER MAPPING) is available, which is confirmed by the aws-cli:

root@vitaliy-parquet-s3-fdw-pgnode01:~# aws s3 ls vitabaks
                           PRE parquet-test/
                           PRE walg/
root@vitaliy-parquet-s3-fdw-pgnode01:~# aws s3 ls vitabaks/parquet-test/
2023-11-24 19:17:36          0 

Does it look like there is a problem with the extension or its dependencies?

@son-phamngoc
Copy link

Hello, thanks for your reporting.

According to the behavior, I think the problem belongs to proxy.
I can reproduce this problem when using parquet_s3_fdw under proxy environment.
I would like to confirm: Are you working under proxy environment?
To connect to AWS S3 through proxy, proxy setting of ClientConfiguration must be set.
They include proxyScheme, proxyHost, proxyPort, proxyUserName and proxyPassword.

Current implementation of use_minio, region and endpoint options are not good. It only allows to choose using region or endpoint to connect to AWS S3 based on value of use_minio option, while actually, AWS SDK can use both options to connect.
They should be updated.

According to your comments and above situation, I attached a patch file to temporally change the implementation. You can use region, endpoint or both to connect.
parquet_s3_fdw_issue_18.patch
If you are working under proxy environment, please update proxy value in patch file before applying.
If you are not, please remove proxy setting in patch file.
However, because parquet_s3_fdw has never been tested with GCS or Ali OSS, I'm not sure if it can work well with it.
Could you try it and let me know the result?

If this patch can solve your problems, we will apply it in next release.
If your problem is different, please provide us more detailed information how to reproduce it.

@CHEN-Jing194
Copy link

CHEN-Jing194 commented May 21, 2024

I have the same question, has this issue been resolved? I did not use any proxies.

@son-phamngoc
Copy link

@CHEN-Jing194 Thanks for your report.

The root cause is not clear, so the issue has not been resolved.
Possibly, the hard-coded scheme (HTTP) is the root cause.
Could you help me to try this patch file?
In this patch file, I supported a new option scheme for SERVER. Please specify the scheme http or https when creating SERVER and try again.
Thank you for your support.

parquet_s3_fdw_issue_18_2.patch

@CHEN-Jing194
Copy link

@CHEN-Jing194 Thanks for your report.

The root cause is not clear, so the issue has not been resolved. Possibly, the hard-coded scheme (HTTP) is the root cause. Could you help me to try this patch file? In this patch file, I supported a new option scheme for SERVER. Please specify the scheme http or https when creating SERVER and try again. Thank you for your support.

parquet_s3_fdw_issue_18_2.patch

I found that there is no corresponding Parquet file on my S3. When I create an external table, no new file will be created on S3. So, what should I do to use the INSERT statement after creating the table?

@son-phamngoc
Copy link

I found that there is no corresponding Parquet file on my S3. When I create an external table, no new file will be created on S3. So, what should I do to use the INSERT statement after creating the table?

@CHEN-Jing194
Please create a foreign table which specifies key column. When you execute an INSERT query, a new parquet file will be created automatically on S3.
Example:

```sql
CREATE FOREIGN TABLE example_insert (
    c1 INT2 OPTIONS (key 'true'),
    c2 TEXT,
    c3 BOOLEAN
) SERVER parquet_s3_srv OPTIONS (filename 's3://data/example_insert.parquet');

INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);

And please let me know if you face the original problem of this issue.

@CHEN-Jing194
Copy link

I found that there is no corresponding Parquet file on my S3. When I create an external table, no new file will be created on S3. So, what should I do to use the INSERT statement after creating the table?

@CHEN-Jing194 Please create a foreign table which specifies key column. When you execute an INSERT query, a new parquet file will be created automatically on S3. Example:

```sql
CREATE FOREIGN TABLE example_insert (
    c1 INT2 OPTIONS (key 'true'),
    c2 TEXT,
    c3 BOOLEAN
) SERVER parquet_s3_srv OPTIONS (filename 's3://data/example_insert.parquet');

INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);

And please let me know if you face the original problem of this issue.

image
This issue still exists.🥹

@son-phamngoc
Copy link

son-phamngoc commented Jun 25, 2024

@CHEN-Jing194, Sorry for unclear explanation.
My above example is not correct. It is used for the case that the parquet file has already existed. Because filename option is specified, parquet_s3_fdw will look for the exact file on S3, and error occurs.

In case of no parquet file exists in S3, you need to use dirname option of FOREIGN TABLE.
There are 2 kinds of usage for dirname option to generate parquet file automatically:

  • Specify dirname option: A new file with file name format [dirname]/[table_name]-[current_time].parquet is generated automatically.
contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw')
CREATE FOREIGN TABLE
contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);
INSERT 0 2

You can see the generated file on S3:

[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw
2024-06-25 10:47:23        713 example_insert-2024-06-25 03:47:22.572209.parquet
  • Specify insert_file_selector option: insert_file_selector allows user to define rule to generate file name. A new file follows this rule will be generated.
    For example, I would like to specify that generated file must have format [dirname]/[new_file.parquet]
contrib_regression=# CREATE FUNCTION selector(dirname text)
RETURNS TEXT AS
$$
    SELECT dirname || '/new_file.parquet';
$$
LANGUAGE SQL;

contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw', insert_file_selector 'selector(dirname)');
CREATE FOREIGN TABLE

contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);
INSERT 0 2

You can see the generated file on S3:

[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw
2024-06-25 11:05:13        713 new_file.parquet

Please let me know if there is any problem.

@CHEN-Jing194
Copy link

@CHEN-Jing194, Sorry for unclear explanation. My above example is not correct. It is used for the case that the parquet file has already existed. Because filename option is specified, parquet_s3_fdw will look for the exact file on S3, and error occurs.

In case of no parquet file exists in S3, you need to use dirname option of FOREIGN TABLE. There are 2 kinds of usage for dirname option to generate parquet file automatically:

  • Specify dirname option: A new file with file name format [dirname]/[table_name]-[current_time].parquet is generated automatically.
contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw')
CREATE FOREIGN TABLE
contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);
INSERT 0 2

You can see the generated file on S3:

[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw
2024-06-25 10:47:23        713 example_insert-2024-06-25 03:47:22.572209.parquet
  • Specify insert_file_selector option: insert_file_selector allows user to define rule to generate file name. A new file follows this rule will be generated.
    For example, I would like to specify that generated file must have format [dirname]/[new_file.parquet]
contrib_regression=# CREATE FUNCTION selector(dirname text)
RETURNS TEXT AS
$$
    SELECT dirname || '/new_file.parquet';
$$
LANGUAGE SQL;

contrib_regression=# CREATE FOREIGN TABLE example_insert (c1 INT2 OPTIONS (key 'true'), c2 text, c3 boolean) SERVER parquet_s3_srv OPTIONS (dirname 's3://parquets3fdw', insert_file_selector 'selector(dirname)');
CREATE FOREIGN TABLE

contrib_regression=# INSERT INTO example_insert VALUES (1, 'text1', true), (2, DEFAULT, false);
INSERT 0 2

You can see the generated file on S3:

[user1@localhost parquet_s3_fdw]$ aws s3 ls parquets3fdw
2024-06-25 11:05:13        713 new_file.parquet

Please let me know if there is any problem.

But how do I set the Access Key Id? I only see options to set a username and password.
image

@son-phamngoc
Copy link

But how do I set the Access Key Id? I only see options to set a username and password.

You can set username as Access Key, and password as Secret access key when creating USER MAPPING.

CREATE USER MAPPING FOR public SERVER parquet_s3_srv OPTIONS (user 'Access Key value', password 'Secret access key value');

@CHEN-Jing194
Copy link

But how do I set the Access Key Id? I only see options to set a username and password.

You can set username as Access Key, and password as Secret access key when creating USER MAPPING.

CREATE USER MAPPING FOR public SERVER parquet_s3_srv OPTIONS (user 'Access Key value', password 'Secret access key value');

image
Now the error is different. It seems like the file still hasn't been created on S3.

@son-phamngoc
Copy link

Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?

@CHEN-Jing194
Copy link

Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?

Of course, I went through the process again.
image

@CHEN-Jing194
Copy link

Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?

Of course, I went through the process again. image

maybe something wrong with the create region of s3 bucket?

@CHEN-Jing194
Copy link

Could you share me the SQLs that you used to create SERVER and FOREIGN TABLE example_insert?

Of course, I went through the process again. image

maybe something wrong with the create region of s3 bucket?

image
failed 🤧

@son-phamngoc
Copy link

The SERVER must be created with option region or endpoint. If not, parquet_s3_fdw will use the default values (ap-northeast-1 and 127.0.0.1:9000).
region and endpoint are options of SERVER. You can't use it for FOREIGN TABLE.
Please try again:

CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-east-1');

@CHEN-Jing194
Copy link

I specified the region but there is still a problem.
image

The SERVER must be created with option region or endpoint. If not, parquet_s3_fdw will use the default values (ap-northeast-1 and 127.0.0.1:9000). region and endpoint are options of SERVER. You can't use it for FOREIGN TABLE. Please try again:

CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-east-1');

I specified the region but there is still a problem.
image

@CHEN-Jing194
Copy link

I specified the region but there is still a problem. image

The SERVER must be created with option region or endpoint. If not, parquet_s3_fdw will use the default values (ap-northeast-1 and 127.0.0.1:9000). region and endpoint are options of SERVER. You can't use it for FOREIGN TABLE. Please try again:

CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (region 'us-east-1');

I specified the region but there is still a problem. image

However, I can still use pg_tier normally, which is a bit strange.

@CHEN-Jing194
Copy link

@son-phamngoc i'm so sorry, I eventually found out that I had installed a lower version of the extension.😭

@son-phamngoc
Copy link

@son-phamngoc i'm so sorry, I eventually found out that I had installed a lower version of the extension.😭

@CHEN-Jing194 No problem.
Do you have any other issue with this FDW?

@CHEN-Jing194
Copy link

@son-phamngoc i'm so sorry, I eventually found out that I had installed a lower version of the extension.😭

@CHEN-Jing194 No problem. Do you have any other issue with this FDW?

Can foreign tables be partitioned? This should help reduce the amount of data scanned and lower costs.

@son-phamngoc
Copy link

@CHEN-Jing194 Sorry for late response.

Could you confirm which way of the following understanding matches your expectation?

  1. Use multiple foreign tables as partitions of a partitioned table
    image
    For this way, parquet_s3_fdw can support it. You can prepare a similar model to use.
    The data must be split into multiple parquet files, and several foreign tables must be created to create that model.
    When querying, partitioned table can scan only the corresponding foreign table, without scanning all foreign tables.
    You can refer to test file partition.sql to see the example.

  2. Use only 1 foreign table. Depending on the condition of query, foreign table scans corresponding part of parquet file without scanning the whole file.
    image
    This way of understanding matches the "chunk" structure of parquet file: https://parquet.apache.org/docs/file-format/
    If this is your expectation, current parquet_s3_fdw does not support it.

@CHEN-Jing194
Copy link

@son-phamngoc
Hello, I found the partition.sql file and did some testing according to the examples. I found some strange phenomena. My query is correct only when the where condition of the query includes the partition key and only involves one partition. Otherwise, the returned records will be duplicated, as shown in the image below.
image

Additionally,

  1. I cannot delete data
    image

  2. When inserting data, I need to specify the specific external partition table, and cannot directly insert into the main table.
    image

@son-phamngoc
Copy link

@CHEN-Jing194
Could you send me the SQL that you used to create partitioned table and foreign tables?
Are your example_part1 and example_part2 foreign tables referring to the same parquet file?
If so, the data will be duplicated when both foreign tables are scanned.

I cannot delete data

Update and Delete can work only when key columns are specified.
Please alter the foreign table to specify key column using key option.
image

When inserting data, I need to specify the specific external partition table, and cannot directly insert into the main table.

Thank you for reporting. I can reproduce this problem.
Possibly this is a bug of parquet_s3_fdw. We will fix it in next releases.

@CHEN-Jing194
Copy link

@son-phamngoc
image
image
image

@son-phamngoc
Copy link

@CHEN-Jing194
Your 2 foreign tables are referring to the same dirname.
When you executed the 1st INSERT query on example_part2 foreign table, a new parquet file was created with format [dirname]/[table_name]-[current_time].parquet.
Next, when you executed the 2nd INSERT query on example_part1, parquet_s3_fdw looked for the parquet file which has that dirname and the schema matches schema of foreign table. Because example_part1 and example_part2 has the same schema and same dirname, data was inserted into the same parquet file.
You can check the file on S3. Only 1 parquet file was created.
Therefore, when you select data, both example_part1 and example_part2 are referring to the same parquet file, so data is duplicated.
If you set different values for 2 foreign tables, 2 separated parquet files will be generated, and you will not get duplicated data.

@CHEN-Jing194
Copy link

@son-phamngoc
Thank you for patiently answering my questions. Yes, that's correct. It worked when I pointed to two different folders separately. Everything is running fine except for the insert operation mentioned above. Another question, why must key columns values be unique? Does it mean that the values of each key column must be unique, or does it mean that the combination of values from multiple key columns must be unique? What problems will arise if I do not ensure this uniqueness?

@son-phamngoc
Copy link

@CHEN-Jing194

Thank you for patiently answering my questions.

No problem. You are welcome.

Another question, why must key columns values be unique? Does it mean that the values of each key column must be unique, or does it mean that the combination of values from multiple key columns must be unique? What problems will arise if I do not ensure this uniqueness?

parquet_s3_fdw uses key columns values to find the correct target record to be updated/deleted.
If you use only 1 key column, the value in that column should be unique.
If you use multiple key columns, the value in a key column can be duplicated, but the combination of key columns values should be unique.
If you do not ensure the uniqueness, parquet_s3_fdw can find wrong target and then update/delete wrong records.
For example:
I have a parquet file and a foreign table t1 which has 3 columns: id (int) as key column, c1 (int), c2 (text). Data is as following:

|id|c1|c2|
|1|200|aaa|
|1|100|bbb|

I want to update all records which has c1 = 100, so SQL query is: UPDATE t1 SET c2 = 'ccc' WHERE c1 = 100.
Firstly, parquet_s3_fdw gets all records from parquet file, cache it and then iterate each record.
The 1st record |1|200|aaa| does not match condition c1 = 100, so skip.
The 2nd record |1|100|bbb| matches condition c1 = 100 so its key value is retrieved (id = 1).
Next, parquet_s3_fdw uses key value id = 1 to search in cache to find the record to be updated.
And then, it found the 1st record in cache which matches condition id = 1, and update that record.
However, that record has c1 = 200, so it is not correct target.

@CHEN-Jing194
Copy link

@son-phamngoc
I don't have any more questions. Now I feel like I can use this extension correctly. Thanks, guys!

@son-phamngoc
Copy link

son-phamngoc commented Jul 18, 2024

@CHEN-Jing194 I'm glad to hear that.
I have a question: Did you use my patch in #18 (comment)?
Did you meet the connection issue if not using my patch?

@CHEN-Jing194
Copy link

@son-phamngoc
No, I did not use that patch. The initial issue I encountered was simply because there was no corresponding file on s3.

@son-phamngoc
Copy link

@CHEN-Jing194 Thank you for your answer.

@CMBCKSRL @mausch @ZhiXingHeYiApple @vitabaks
Would you try my patch file at #18 (comment) and notify me if it can fix your issue?
Your feedback is useful for us to improve the quality of this FDW.
Thank you for your support.

@CHEN-Jing194
Copy link

@son-phamngoc Hello, I have another question. Is fdw requesting the corresponding parquet file to the local and then performing SQL operations? I tested it and found that the traffic is quite high. I originally thought it was using the functionality of s3 select, but it seems like it's not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants