feat: multiple collection support in source/destination connector #97

grvsahil · 2024-10-08T15:28:58Z

Description

This PR includes the following changes:

Multiple collection/table support.
Updated record metadata field that stores/fetches a Redshift table name from redshift.table to opencdc.collection for compatibility with other connectors.
Updated README to include changes for multiple collection.

Fixes # (issue)

Quick checks:

There is no other pull request for the same update/change.
I have written unit tests.
I have made sure that the PR is of reasonable size and can be easily reviewed.

README.md

hariso · 2024-10-17T08:43:02Z

acceptance_test.go

@@ -36,7 +36,7 @@ const (
 	// envNameDSN is a Redshift dsn environment name.
 	envNameDSN = "REDSHIFT_DSN"
 	// metadataFieldTable is a name of a record metadata field that stores a Redshift table name.
-	metadataFieldTable = "redshift.table"
+	metadataFieldTable = "opencdc.collection"


Nitpick: we can use opencdc.MetadataCollection from Conduit Commons.

hariso · 2024-10-17T08:49:05Z

common/error.go

@@ -67,3 +67,26 @@ func (e GreaterThanError) Error() string {
 func NewGreaterThanError(fieldName string, value int) GreaterThanError {
 	return GreaterThanError{fieldName: fieldName, value: value}
 }
+
+type NoTablesOrColumnsError struct{}


Maybe it's simpler to just use errors.New() for this one?

hariso · 2024-10-17T08:49:58Z

destination/destination_test.go

 	is.True(err != nil)
 	is.Equal(err.Error(),
-		`config invalid: error validating "table": required parameter is not provided`)
+		`config invalid: error validating "dsn": required parameter is not provided`)


We should still validate if the tables or table parameter is present.

hariso · 2024-10-17T09:20:25Z

common/error.go

+}
+
+func NewMismatchedTablesAndColumnsError(tablesCount, orderingColumnsCount int) MismatchedTablesAndColumnsError {
+	return MismatchedTablesAndColumnsError{tablesCount: tablesCount, orderingColumnsCount: orderingColumnsCount}


IMHO, we should prefer using fmt.Errorf() over creating new error types. The fields in the error structs (e.g. tablesCount and orderingColumnsCount) are only ever accessed in the Error() function. If we had some logic for those fields, then I think it might make sense having an error type.

hariso · 2024-10-17T09:35:07Z

source/config/config.go

+	Tables []string `json:"tables" validate:"required"`
+	// OrderingColumns is a list of corresponding ordering columns for the table
+	// that the connector will use for ordering rows.
+	OrderingColumns []string `json:"orderingColumns" validate:"required"`


This looks good, but there's something in our tool, paramgen, that can simplify the code and make it a bit safer. It has "dynamic configuration parameters", where it basically allows you to have a map like here: https://github.com/ConduitIO/conduit-connector-generator/blob/main/config.go#L50.

This makes it possible to have a config like this: https://github.com/ConduitIO/conduit-connector-generator?tab=readme-ov-file#collections.

In our case that would something like:

orderingColumns.tableNameFoo: columnFoo orderingColumns.tableNameBar: columnBar

This is more readable (it's clear which column goes to which table).

Okay, We can also format it like

tables.tableNameFoo.orderingColumn: columnFoo tables.tableNameBar.orderingColumn: columnBar

Does this look okay ?

Looks good to me!

hariso · 2024-10-17T11:19:36Z

source/iterator/iterator.go

-		} else {
-			// set the LastProcessedValue to skip a snapshot of the entire table
-			iterator.position.LastProcessedValue = latestSnapshotValue
+		worker, err := NewWorker(ctx, WorkerConfig{


It would be good to start each worker in a separate goroutine, so that work can be parallelized (we're doing something similar in the Postgres connector). It's also closer to the current behavior: to read data from multiple tables, multiple source connectors are needed (one table per source connector), and source connectors work in parallel. We can tackle that in a separate PR.

Yeah right, I'll implement it in a separate PR. The source connector will spawn workers (one for each table) in seperate goroutines and read records using a common channel.

Sounds good to me! Feel free to open an issue for that so we can track it.:)

destination/config/config_test.go

hariso · 2024-10-17T11:33:27Z

source/iterator/iterator.go

+		if !ok {
+			position = TablePosition{
+				LastProcessedValue:  nil,
+				LatestSnapshotValue: nil,
+			}
 		}


I don't think this is needed, because you'll get the zero value for the TablePosition struct, which is an "empty" struct and has all zero values in fields too.

hariso · 2024-10-17T11:36:18Z

source/iterator/worker.go

+}
+
+// HasNext returns a bool indicating whether the source has the next record to return or not.
+func (worker *Worker) HasNext(ctx context.Context) (bool, error) {


Nitpick: it's common to name the receiver with one letter, e.g. w in this case.

hariso

Good work! I left a few comments and questions, I believe we can merge this PR pretty soon.:)

hariso · 2024-10-24T13:29:53Z

destination/config/config.go

 // Config is a destination configuration needed to connect to Redshift database.
 type Config struct {
 	common.Configuration
+	// Table is the configuration of the table name.
+	Table string `json:"table" default:"{{ index .Metadata \"opencdc.collection\" }}"`
+	// KeyColumns is the configuration of comma-separated column names to build the sdk.Record.Key.


Nitpick: sdk.Record.Key -> "record key", since I think it's pretty clear what that is about (plus, the sdk.Record type is gone, it's opencdc.Record).

hariso · 2024-10-24T13:43:56Z

source/config/config.go

+	// c.BatchSize handling "gte=1" and "lte=100000" validations.
+	if c.BatchSize < common.MinConfigBatchSize {
+		return common.NewGreaterThanError(ConfigBatchSize, common.MinConfigBatchSize)
+	}
+	if c.BatchSize > common.MaxConfigBatchSize {
+		return common.NewLessThanError(ConfigBatchSize, common.MaxConfigBatchSize)
+	}
+


Our paramgen tool supports validating min and max for a value: https://github.com/ConduitIO/conduit-commons/tree/main/paramgen#parameter-tags (see no. 3).

hariso · 2024-10-24T13:44:53Z

source/iterator/iterator.go

-		} else {
-			// set the LastProcessedValue to skip a snapshot of the entire table
-			iterator.position.LastProcessedValue = latestSnapshotValue
+		worker, err := NewWorker(ctx, WorkerConfig{


Sounds good to me! Feel free to open an issue for that so we can track it.:)

hariso · 2024-10-24T13:46:31Z

source/iterator/position_test.go

-			name: "success_position_is_nil",
-			in:   Position{},
-			want: opencdc.Position(`{"lastProcessedValue":null,"latestSnapshotValue":null}`),
-		},
-		{
-			name: "success_integer_fields",
-			in: Position{
-				LastProcessedValue:  10,
-				LatestSnapshotValue: 30,
-			},
-			want: opencdc.Position(`{"lastProcessedValue":10,"latestSnapshotValue":30}`),
-		},
-		{
-			name: "success_string_fields",
-			in: Position{
-				LastProcessedValue:  "abc",
-				LatestSnapshotValue: "def",
-			},
-			want: opencdc.Position(`{"lastProcessedValue":"abc","latestSnapshotValue":"def"}`),


Is there a special reason why these tests are not valid anymore?

Some of these are actually valid, but should be done for individual tables. Adding them again against table position.

hariso · 2024-10-24T13:53:18Z

destination/writer/writer.go

-	tableName, ok := metadata[metadataFieldTable]
-	if !ok {
-		return w.table
+func (w *Writer) preparePayloadAndKey(record opencdc.Record) (map[string]interface{}, map[string]interface{}, error) {


Nitpick maybe return opencdc.StructuredData instead of the map?

hariso · 2024-10-24T13:55:42Z

destination/writer/writer.go

@@ -94,15 +107,20 @@ func (w *Writer) Insert(ctx context.Context, record opencdc.Record) error {
 		return ErrNoPayload
 	}

-	payload, err = columntypes.ConvertStructuredData(w.columnTypes, payload)
+	columnTypes, err := w.getColumnTypes(ctx, table)


IIUC, this executes a query (to get the column types) for each record?

It retrieves the column types if its already fetched and stored for the record's table or executes the query to get the column types if record from a new table is encountered which isn't already stored.

hariso · 2024-10-24T14:07:22Z

source/iterator/worker.go

+	columnTypes map[string]string
+
+	// iterator is an instance of the iterator
+	iterator *Iterator


Hmm, what's Worker supposed to do, and what is Iterator supposed to do? IIUC, Iterator is a "global" iterator, for the whole source connector, and Worker is actually an iterator for a single table?

hariso · 2024-10-24T14:10:39Z

source/iterator/worker.go

+	if err != nil {
+		return fmt.Errorf("close db rows: %w", err)
+	}
+


Nitpick: this can in the body of the if statement, i.e.:

if w.rows != nil { err := w.Rows.Close() if err != nil { return fmt.Errorf() } }

It's the same result, but the scope of err is smaller.

Gaurav Sahil added 3 commits October 8, 2024 20:47

feat: multiple collection support in source/destination connector

ab80ccb

updated source paramgen

422aa47

fix: resolved conflicts and added config test cases

853ee18

raulb assigned grvsahil Oct 15, 2024

hariso self-requested a review October 17, 2024 08:59

hariso reviewed Oct 17, 2024

View reviewed changes

parikshitg mentioned this pull request Oct 17, 2024

Elastic Search Source conduitio-labs/conduit-connector-elasticsearch#95

Open

4 tasks

Gaurav Sahil added 2 commits October 21, 2024 17:32

fix: pr comments

b86475d

fix: merged sdk updates

20c6458

grvsahil marked this pull request as ready for review October 21, 2024 12:14

grvsahil requested a review from a team as a code owner October 21, 2024 12:14

hariso reviewed Oct 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multiple collection support in source/destination connector #97

feat: multiple collection support in source/destination connector #97

grvsahil commented Oct 8, 2024

hariso Oct 17, 2024

hariso Oct 17, 2024

hariso Oct 17, 2024

hariso Oct 17, 2024

hariso Oct 17, 2024

grvsahil Oct 17, 2024 •

edited

Loading

hariso Oct 18, 2024

hariso Oct 17, 2024

grvsahil Oct 17, 2024

hariso Oct 24, 2024

hariso Oct 17, 2024

hariso Oct 17, 2024

hariso left a comment

hariso Oct 24, 2024

hariso Oct 24, 2024

hariso Oct 24, 2024

hariso Oct 24, 2024

grvsahil Oct 25, 2024

hariso Oct 24, 2024

hariso Oct 24, 2024

grvsahil Oct 25, 2024 •

edited

Loading

hariso Oct 24, 2024

grvsahil Oct 25, 2024

hariso Oct 24, 2024

feat: multiple collection support in source/destination connector #97

Are you sure you want to change the base?

feat: multiple collection support in source/destination connector #97

Conversation

grvsahil commented Oct 8, 2024

Description

Quick checks:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grvsahil Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hariso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grvsahil Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grvsahil Oct 17, 2024 •

edited

Loading

grvsahil Oct 25, 2024 •

edited

Loading