Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to read partial columns from Kafka source with data format CSV #817

Open
yl-lisen opened this issue Aug 12, 2024 · 2 comments
Open
Assignees
Labels
bug Something isn't working community Feedback from community

Comments

@yl-lisen
Copy link
Collaborator

Describe what's wrong

How to reproduce

CREATE EXTERNAL STREAM account
(
  `id` int,
  `name` string
)
SETTINGS type = 'kafka', topic = 'topic_account', brokers = 'stream-store:9092', data_format = 'CSV';

--- Ingest Kafka message: "1, 1"


--- Read all columns is ok
timeplusd :) select * from account;

SELECT
  *
FROM
  account

Query id: a15b8db7-2af1-4d4c-9a9c-8b074518935d

┌─id─┬─name─┐
│  11    │
└────┴──────┘


--- Read partial column but no output.
select id from account;

SELECT
  id
FROM
  account

Query id: 2d032487-becb-4e26-b773-7733dd5521e7

↗ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.)                                                                  ^
Cancelling query.

there is error log

2024.08.12 20:41:51.234635 [ 2779177 ] {af315de4-b892-455b-949f-0c013580efcd} <Error> account.rdkafka#consumer-6: Failed to parse message at 2: Expected end of line: (at row 1)
: 
Row 1:
Column 0,   name: id, type: int32, parsed text: "1"
ERROR: There is no line feed. "1" found instead.
 It's like your file has more columns than expected.
And if your file has the right number of columns, maybe it has an unquoted string value with a comma.


Error message and/or stacktrace

Additional context

@yl-lisen yl-lisen added the bug Something isn't working label Aug 12, 2024
@jovezhong jovezhong added the community Feedback from community label Aug 12, 2024
@zliang-min
Copy link
Collaborator

zliang-min commented Aug 12, 2024

This is because it uses the columns from SELECT to create the InputFormat. For formats with schema information, like Protobuf, Avro, etc. they know how to get the wanted columns.

In order to make formats like CSV works, we will need to either initialize such format with the full table schema ( which means it will generate the full chunk of columns no matter how many columns are selected by the query. So it will be less efficient, but this will be a simpler change ), or we will need to refactor such formats and make them be able to skip reading some fields ( they will need to know the column mapping to figure out which fields to skip. The format will still need to parse the data, but it does not need to generate the unneeded columns).

@yl-lisen
Copy link
Collaborator Author

agree it, we can make this scenario works first, next, do partial parsing optimization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community Feedback from community
Projects
None yet
Development

No branches or pull requests

3 participants