Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFS now uses reactor-netty for bulk indexing #607

Merged
merged 5 commits into from
Apr 24, 2024

Conversation

chelma
Copy link
Member

@chelma chelma commented Apr 23, 2024

Description

  • Updated RFS to use the reactor-netty library to perform asynchronous HTTP operations
  • Updated RFS to perform indexing against the target cluster w/ bulk operations
  • Added logging to capture when there's a wacky Lucene document we can't parse correctly (such as not having an _id)

Issues Resolved

Testing

  • Added unit tests for the changes to the ConnectionDetails class
  • Manually tested the changes to the reindexing behavior. Example output below:
16:31:59.167 INFO  Blob files unpacked successfully
16:31:59.167 INFO  ==================================================================
16:31:59.167 INFO  Reindexing the documents...
16:31:59.171 INFO  === Index Id: logs-241998, Shard ID: 0 ===
16:31:59.586 INFO  199 documents found in the current Lucene index
16:31:59.627 INFO  199 documents in current bulk request
16:31:59.999 INFO  Shard reindexing completed
16:31:59.999 INFO  === Index Id: logs-241998, Shard ID: 1 ===
16:32:00.008 INFO  203 documents found in the current Lucene index
16:32:00.028 INFO  203 documents in current bulk request
16:32:00.030 INFO  Shard reindexing completed
16:32:00.030 INFO  === Index Id: logs-241998, Shard ID: 2 ===
16:32:00.037 INFO  201 documents found in the current Lucene index
16:32:00.048 INFO  201 documents in current bulk request
16:32:00.049 INFO  Shard reindexing completed
16:32:00.049 INFO  === Index Id: logs-241998, Shard ID: 3 ===
16:32:00.057 INFO  201 documents found in the current Lucene index
16:32:00.062 INFO  201 documents in current bulk request
16:32:00.063 INFO  Shard reindexing completed
16:32:00.063 INFO  === Index Id: logs-241998, Shard ID: 4 ===
16:32:00.070 INFO  196 documents found in the current Lucene index
16:32:00.076 INFO  196 documents in current bulk request
16:32:00.078 INFO  Shard reindexing completed
16:32:00.079 INFO  === Index Id: logs-191998, Shard ID: 0 ===
16:32:00.082 INFO  206 documents found in the current Lucene index
16:32:00.090 INFO  206 documents in current bulk request
16:32:00.091 INFO  Shard reindexing completed
16:32:00.092 INFO  === Index Id: logs-191998, Shard ID: 1 ===
16:32:00.101 INFO  204 documents found in the current Lucene index
16:32:00.117 INFO  204 documents in current bulk request
16:32:00.118 INFO  Shard reindexing completed
16:32:00.118 INFO  === Index Id: logs-191998, Shard ID: 2 ===
16:32:00.124 INFO  192 documents found in the current Lucene index
16:32:00.129 INFO  192 documents in current bulk request
16:32:00.131 INFO  Shard reindexing completed
16:32:00.131 INFO  === Index Id: logs-191998, Shard ID: 3 ===
16:32:00.135 INFO  196 documents found in the current Lucene index
16:32:00.141 INFO  196 documents in current bulk request
16:32:00.142 INFO  Shard reindexing completed
16:32:00.143 INFO  === Index Id: logs-191998, Shard ID: 4 ===
16:32:00.148 INFO  202 documents found in the current Lucene index
16:32:00.156 INFO  202 documents in current bulk request
16:32:00.157 INFO  Shard reindexing completed
16:32:00.158 INFO  === Index Id: logs-221998, Shard ID: 0 ===
16:32:00.177 INFO  190 documents found in the current Lucene index
16:32:00.184 INFO  190 documents in current bulk request
16:32:00.185 INFO  Shard reindexing completed
16:32:00.185 INFO  === Index Id: logs-221998, Shard ID: 1 ===
16:32:00.218 INFO  207 documents found in the current Lucene index
16:32:00.235 INFO  207 documents in current bulk request
16:32:00.236 INFO  Shard reindexing completed
16:32:00.237 INFO  === Index Id: logs-221998, Shard ID: 2 ===
16:32:00.255 INFO  188 documents found in the current Lucene index
16:32:00.281 INFO  188 documents in current bulk request
16:32:00.283 INFO  Shard reindexing completed
16:32:00.283 INFO  === Index Id: logs-221998, Shard ID: 3 ===
16:32:00.373 INFO  215 documents found in the current Lucene index
16:32:00.483 INFO  215 documents in current bulk request
16:32:00.484 INFO  Shard reindexing completed
16:32:00.484 INFO  === Index Id: logs-221998, Shard ID: 4 ===
16:32:00.535 INFO  200 documents found in the current Lucene index
16:32:00.553 INFO  200 documents in current bulk request
16:32:00.554 INFO  Shard reindexing completed
16:32:00.554 INFO  === Index Id: logs-231998, Shard ID: 0 ===
16:32:00.587 INFO  208 documents found in the current Lucene index
16:32:00.594 INFO  208 documents in current bulk request
16:32:00.595 INFO  Shard reindexing completed
16:32:00.595 INFO  === Index Id: logs-231998, Shard ID: 1 ===
16:32:00.603 INFO  192 documents found in the current Lucene index
16:32:00.610 INFO  192 documents in current bulk request
16:32:00.611 INFO  Shard reindexing completed
16:32:00.611 INFO  === Index Id: logs-231998, Shard ID: 2 ===
16:32:00.614 INFO  190 documents found in the current Lucene index
16:32:00.629 INFO  190 documents in current bulk request
16:32:00.630 INFO  Shard reindexing completed
16:32:00.630 INFO  === Index Id: logs-231998, Shard ID: 3 ===
16:32:00.637 INFO  224 documents found in the current Lucene index
16:32:00.655 INFO  224 documents in current bulk request
16:32:00.656 INFO  Shard reindexing completed
16:32:00.656 INFO  === Index Id: logs-231998, Shard ID: 4 ===
16:32:00.662 INFO  186 documents found in the current Lucene index
16:32:00.665 INFO  186 documents in current bulk request
16:32:00.666 INFO  Shard reindexing completed
16:32:00.666 INFO  === Index Id: reindexed-logs, Shard ID: 0 ===
16:32:00.669 INFO  0 documents found in the current Lucene index
16:32:00.672 INFO  Shard reindexing completed
16:32:00.672 INFO  === Index Id: reindexed-logs, Shard ID: 1 ===
16:32:00.676 INFO  0 documents found in the current Lucene index
16:32:00.676 INFO  Shard reindexing completed
16:32:00.676 INFO  === Index Id: reindexed-logs, Shard ID: 2 ===
16:32:00.679 INFO  0 documents found in the current Lucene index
16:32:00.680 INFO  Shard reindexing completed
16:32:00.680 INFO  === Index Id: reindexed-logs, Shard ID: 3 ===
16:32:00.682 INFO  0 documents found in the current Lucene index
16:32:00.682 INFO  Shard reindexing completed
16:32:00.682 INFO  === Index Id: reindexed-logs, Shard ID: 4 ===
16:32:00.685 INFO  0 documents found in the current Lucene index
16:32:00.685 INFO  Shard reindexing completed
16:32:00.685 INFO  === Index Id: logs-201998, Shard ID: 0 ===
16:32:00.715 INFO  222 documents found in the current Lucene index
16:32:00.723 INFO  222 documents in current bulk request
16:32:00.723 INFO  Shard reindexing completed
16:32:00.724 INFO  === Index Id: logs-201998, Shard ID: 1 ===
16:32:00.760 INFO  193 documents found in the current Lucene index
16:32:00.765 INFO  193 documents in current bulk request
16:32:00.765 INFO  Shard reindexing completed
16:32:00.766 INFO  === Index Id: logs-201998, Shard ID: 2 ===
16:32:00.783 INFO  188 documents found in the current Lucene index
16:32:00.786 INFO  188 documents in current bulk request
16:32:00.787 INFO  Shard reindexing completed
16:32:00.787 INFO  === Index Id: logs-201998, Shard ID: 3 ===
16:32:00.800 INFO  191 documents found in the current Lucene index
16:32:00.811 INFO  191 documents in current bulk request
16:32:00.812 INFO  Shard reindexing completed
16:32:00.812 INFO  === Index Id: logs-201998, Shard ID: 4 ===
16:32:00.831 INFO  206 documents found in the current Lucene index
16:32:00.841 INFO  206 documents in current bulk request
16:32:00.843 INFO  Shard reindexing completed
16:32:00.843 INFO  === Index Id: sonested, Shard ID: 0 ===
16:32:00.854 INFO  2977 documents found in the current Lucene index
16:32:00.856 ERROR Unable to parse Document id from Document.  The Document's Fields: 
16:32:00.877 INFO  Shard reindexing completed
16:32:00.877 INFO  === Index Id: nyc_taxis, Shard ID: 0 ===
16:32:00.882 INFO  1000 documents found in the current Lucene index
16:32:01.179 INFO  1000 documents in current bulk request
16:32:01.181 INFO  Shard reindexing completed
16:32:01.181 INFO  === Index Id: logs-211998, Shard ID: 0 ===
16:32:01.187 INFO  206 documents found in the current Lucene index
16:32:01.190 INFO  206 documents in current bulk request
16:32:01.191 INFO  Shard reindexing completed
16:32:01.191 INFO  === Index Id: logs-211998, Shard ID: 1 ===
16:32:01.199 INFO  189 documents found in the current Lucene index
16:32:01.201 INFO  189 documents in current bulk request
16:32:01.202 INFO  Shard reindexing completed
16:32:01.202 INFO  === Index Id: logs-211998, Shard ID: 2 ===
16:32:01.211 INFO  190 documents found in the current Lucene index
16:32:01.213 INFO  190 documents in current bulk request
16:32:01.214 INFO  Shard reindexing completed
16:32:01.214 INFO  === Index Id: logs-211998, Shard ID: 3 ===
16:32:01.230 INFO  223 documents found in the current Lucene index
16:32:01.233 INFO  223 documents in current bulk request
16:32:01.233 INFO  Shard reindexing completed
16:32:01.234 INFO  === Index Id: logs-211998, Shard ID: 4 ===
16:32:01.271 INFO  192 documents found in the current Lucene index
16:32:01.279 INFO  192 documents in current bulk request
16:32:01.280 INFO  Shard reindexing completed
16:32:01.280 INFO  === Index Id: logs-181998, Shard ID: 0 ===
16:32:01.313 INFO  214 documents found in the current Lucene index
16:32:01.317 INFO  214 documents in current bulk request
16:32:01.318 INFO  Shard reindexing completed
16:32:01.318 INFO  === Index Id: logs-181998, Shard ID: 1 ===
16:32:01.331 INFO  192 documents found in the current Lucene index
16:32:01.338 INFO  192 documents in current bulk request
16:32:01.339 INFO  Shard reindexing completed
16:32:01.340 INFO  === Index Id: logs-181998, Shard ID: 2 ===
16:32:01.364 INFO  183 documents found in the current Lucene index
16:32:01.369 INFO  183 documents in current bulk request
16:32:01.370 INFO  Shard reindexing completed
16:32:01.370 INFO  === Index Id: logs-181998, Shard ID: 3 ===
16:32:01.384 INFO  193 documents found in the current Lucene index
16:32:01.388 INFO  193 documents in current bulk request
16:32:01.388 INFO  Shard reindexing completed
16:32:01.388 INFO  === Index Id: logs-181998, Shard ID: 4 ===
16:32:01.405 INFO  218 documents found in the current Lucene index
16:32:01.410 INFO  218 documents in current bulk request
16:32:01.411 INFO  Shard reindexing completed
16:32:01.411 INFO  === Index Id: geonames, Shard ID: 0 ===
16:32:01.438 INFO  206 documents found in the current Lucene index
16:32:01.453 INFO  206 documents in current bulk request
16:32:01.464 INFO  Shard reindexing completed
16:32:01.464 INFO  === Index Id: geonames, Shard ID: 1 ===
16:32:01.473 INFO  210 documents found in the current Lucene index
16:32:01.479 INFO  210 documents in current bulk request
16:32:01.495 INFO  Shard reindexing completed
16:32:01.495 INFO  === Index Id: geonames, Shard ID: 2 ===
16:32:01.506 INFO  201 documents found in the current Lucene index
16:32:01.515 INFO  201 documents in current bulk request
16:32:01.521 INFO  Shard reindexing completed
16:32:01.521 INFO  === Index Id: geonames, Shard ID: 3 ===
16:32:01.531 INFO  188 documents found in the current Lucene index
16:32:01.535 INFO  188 documents in current bulk request
16:32:01.543 INFO  Shard reindexing completed
16:32:01.543 INFO  === Index Id: geonames, Shard ID: 4 ===
16:32:01.550 INFO  195 documents found in the current Lucene index
16:32:01.555 INFO  195 documents in current bulk request
16:32:01.564 INFO  Shard reindexing completed

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chris Helma <chelma+github@amazon.com>
Signed-off-by: Chris Helma <chelma+github@amazon.com>
protocol = null;
} else {
// Parse the URL to get the protocol, host name, and port
String[] urlParts = url.split("://");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use java.net.URI for

        if (url == null) {
            hostName = null;
            port = -1;
            protocol = null;
        } else {
            try {
                URI uri = new URI(url);
                hostName = uri.getHost();
                port = uri.getPort();
                protocol = uri.getScheme();
            } catch (URISyntaxException e) {
                throw new IllegalArgumentException("Invalid URL format", e);
            }
        }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting - this makes sense. I was thinking about adding regex checking for the user inputs at beginning of too, but defense-in-depth is a good approach.


public static void reindex(String indexName, Flux<Document> documentStream, ConnectionDetails targetConnection) throws Exception {
String targetUrl = "/" + indexName + "/_bulk";
HttpClient client = HttpClient.create()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split the http client into a separate class which may get reused by different operations

Copy link
Member Author

@chelma chelma Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to at this point, actually. While I would historically agree with you, I'm trying out a new approach on this project and been really happy with how it has worked out. Specifically - avoiding being too speculative about abstractions and letting the needs of the project shape what gets created. In this case, we only have one thing that needs this reactor-netty client, and I honestly don't know what interface I would provide if I were to carve it out because I don't know how another potential part of the code might use it. Avoiding speculation on past abstractions in this project's history has been one of the key things that has enabled me to make so much progress so fast.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you need a separate HttpClient interface yet, but I do think that it might help & in general, you'll want to look to the future and not think about the past.
From my view, you've got some leaky abstractions with a couple other needless Flux contaminations within your codebase.
Once you do that, it will become harder to test your code too (less tests help us write application code faster too). If you want to write test code fast too - keep it as generic as you can with the cleanest interfaces that you can strive for. Simpler pieces -> smoother integrations -> faster delivery of quality solutions.

// Assemble the request details
String path = indexName + "/_doc/" + id;
String body = source;
private static String convertToBulkJson(List<String> bulkSections) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the function name...BulkJson was a bit confusing since it's just delimited jsons, maybe convertToBulkBody or convertToDelimitedJsons`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

return null; // Skip documents with missing id
}
if (source_bytes == null || source_bytes.bytes.length == 0) {
logger.warn("Document " + id + " is deleted or doesn't have the _source field enabled");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be better suited for info if this expected for deleted documents

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt (and I guess still feel) that warn is probably the right level. It's something that we should highlight the occurence of without being an error, per se.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one is it - can you tell the difference? Could the reader of the log tell the difference? Is there something in the beginning of the log that would give the user a clue?

If _source wasn't enabled, could this flood the logs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance that docId could have PII in it? The docId could be customer generated, right? Or are they only internal ids that are separately mapped to the customer-given ones?

If they're customer driven, I'd push this to debug to promote the policy that no PII could be shown for INFO and above logs. This feels like it isn't a great spot to be in. I'm hoping that there's a way to show an identifier without risking divulging a customer value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one is it - can you tell the difference?

I am not currently aware of how to tell the difference. We have a task to look into this more (see: https://opensearch.atlassian.net/browse/MIGRATIONS-1629)

Is there any chance that docId could have PII in it

The docId is an integer value used by Lucene to tell which Lucene Document in the Lucene Index is being referred to. The _id field of the Lucene Document is a user-set alphanumeric string, and so can contain whatever the user wants it to.

Regarding PII - that's a larger discussion for the team to have. I'll book a timeslot to discuss as a reminder.

Signed-off-by: Chris Helma <chelma+github@amazon.com>
Signed-off-by: Chris Helma <chelma+github@amazon.com>
Copy link

codecov bot commented Apr 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.93%. Comparing base (384401e) to head (060fdbe).
Report is 21 commits behind head on main.

❗ Current head 060fdbe differs from pull request most recent head bd87ce5. Consider uploading reports for the commit bd87ce5 to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #607      +/-   ##
============================================
+ Coverage     75.91%   75.93%   +0.02%     
- Complexity     1491     1496       +5     
============================================
  Files           162      165       +3     
  Lines          6348     6362      +14     
  Branches        572      573       +1     
============================================
+ Hits           4819     4831      +12     
+ Misses         1152     1149       -3     
- Partials        377      382       +5     
Flag Coverage Δ
unittests 75.93% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Chris Helma <chelma+github@amazon.com>
@chelma chelma merged commit 98ee1fd into opensearch-project:main Apr 24, 2024
5 checks passed
@chelma chelma deleted the MIGRATIONS-1600-2 branch April 24, 2024 16:43
return Flux.range(0, reader.maxDoc()) // Extract all the Documents in the IndexReader
.handle((i, sink) -> {
Document doc = getDocument(reader, i);
if (doc != null) { // Skip malformed docs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should at least log when doc == null (or whatever malformed documents that you might be skipping).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We log that in getDocument()

continue;
reader.close();
} catch (IOException e) {
logger.error("Failed to close IndexReader", e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's probably a really bad exception. Why should the program keep running?
This seems like a spot where throw Lombok.sneakyThrow(e) would be a better option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question; probably does make sense to kill the process at this point. I realized just now that Reactor was unhappy with the fact that a checked exception was being thrown but I totally could have thrown an unchecked exception here or something.

return null; // Skip documents with missing id
}
if (source_bytes == null || source_bytes.bytes.length == 0) {
logger.warn("Document " + id + " is deleted or doesn't have the _source field enabled");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one is it - can you tell the difference? Could the reader of the log tell the difference? Is there something in the beginning of the log that would give the user a clue?

If _source wasn't enabled, could this flood the logs?

return null; // Skip documents with missing id
}
if (source_bytes == null || source_bytes.bytes.length == 0) {
logger.warn("Document " + id + " is deleted or doesn't have the _source field enabled");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance that docId could have PII in it? The docId could be customer generated, right? Or are they only internal ids that are separately mapped to the customer-given ones?

If they're customer driven, I'd push this to debug to promote the policy that no PII could be shown for INFO and above logs. This feels like it isn't a great spot to be in. I'm hoping that there's a way to show an identifier without risking divulging a customer value.

StringBuilder errorMessage = new StringBuilder();
errorMessage.append("Unable to parse Document id from Document. The Document's Fields: ");
document.getFields().forEach(f -> errorMessage.append(f.name()).append(", "));
logger.error(errorMessage.toString());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.atError().setCause(e).setMessage(()->...).log() will do two more things for you. 1) get the exception and its backtrace into the logs and 2) use the fluent style, where everything within '...' will only be evaluated when you're logging that level. It can make your log statements tighter (all one statement rather than 4 as they are here) and much more efficient since work can often be elided. Even if you stay at warn/error, I'd like to routinely filter the repo for usages of immediate logging because its performance hit can be the single greatest impact on a program.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider PII for ERROR. I think that it's fair, but you should call it out... maybe PII possible loggers should have their own logger name convention so that operators could easily mask them out if necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.atError().setCause(e).setMessage(()->...).log()

Cool - will look into that for the future.

Consider PII for ERROR

I think we need to have a larger discussion around stuff like PII concerns, because I suspect they will impact many aspects of the implementation if we're designing to address them up front.


public class LuceneDocumentsReader {
private static final Logger logger = LogManager.getLogger(LuceneDocumentsReader.class);

public static List<Document> readDocuments(Path luceneFilesBasePath, String indexName, int shardId) throws Exception {
public Flux<Document> readDocuments(Path luceneFilesBasePath, String indexName, int shardId) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does your LuceneDocumentReader now take a hard dependency on your HTTP client library?
It might be better to make this a collection or stream & then adapt later so that you can switch client implementations out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but it seems like this is how the Reactor framework wants to be used. I can see both the LuceneDocumentsReader and DocumentReindexer classes being implementation specific. So far it's paid off for me in this project not to speculate on stuff like this until there's a specific need.



public class DocumentReindexer {
private static final Logger logger = LogManager.getLogger(DocumentReindexer.class);
private static final int MAX_BATCH_SIZE = 1000; // Arbitrarily chosen

public static void reindex(String indexName, Flux<Document> documentStream, ConnectionDetails targetConnection) throws Exception {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question from above - why should this take a Flux in? What would be lost/what would the impact be if you took in a stream and adapted it within this method?


public static void reindex(String indexName, Flux<Document> documentStream, ConnectionDetails targetConnection) throws Exception {
String targetUrl = "/" + indexName + "/_bulk";
HttpClient client = HttpClient.create()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you need a separate HttpClient interface yet, but I do think that it might help & in general, you'll want to look to the future and not think about the past.
From my view, you've got some leaky abstractions with a couple other needless Flux contaminations within your codebase.
Once you do that, it will become harder to test your code too (less tests help us write application code faster too). If you want to write test code fast too - keep it as generic as you can with the cleanest interfaces that you can strive for. Simpler pieces -> smoother integrations -> faster delivery of quality solutions.

@@ -34,7 +34,7 @@ public RestClient(ConnectionDetails connectionDetails) {
}

public Response get(String path, boolean quietLogging) throws Exception {
String urlString = connectionDetails.host + "/" + path;
String urlString = connectionDetails.url + "/" + path;

URL url = new URL(urlString);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the plan is to deprecate this class, use the @deprecate annotation for it (before class RestClient) so that we know that the plan is to rally all of the code around one HTTP client solution. As it is, it's pretty confusing with 2 different clients within one codebase/PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we want to deprecate this class or not in the long run. I would assume so, but honestly the only place we really need to use the greater abilities of the reactor-netty client is for reindexing; this is fine elsewhere. For that reason, I left this in place for the time being.

doc5.add(new StringField("_id", new BytesRef(encodeUtf8Id("id5")), Field.Store.YES));

// Set up our mock reader
IndexReader mockReader = mock(IndexReader.class);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's sync-up on mockito. I wonder if this could have been clearer and tighter without mockito.

AndreKurait pushed a commit to AndreKurait/opensearch-migrations that referenced this pull request Apr 25, 2024
* Checkpoint: improved ConnectionDetails; unit tested it

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* RFS now uses reactor-netty and bulk indexing

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Fixes per PR; unit tested LuceneDocumentsReader

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Updated a unit test name

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Updated a method name per PR feedback

Signed-off-by: Chris Helma <chelma+github@amazon.com>

---------

Signed-off-by: Chris Helma <chelma+github@amazon.com>
AndreKurait pushed a commit to AndreKurait/opensearch-migrations that referenced this pull request Apr 25, 2024
* Checkpoint: improved ConnectionDetails; unit tested it

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* RFS now uses reactor-netty and bulk indexing

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Fixes per PR; unit tested LuceneDocumentsReader

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Updated a unit test name

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Updated a method name per PR feedback

Signed-off-by: Chris Helma <chelma+github@amazon.com>

---------

Signed-off-by: Chris Helma <chelma+github@amazon.com>
AndreKurait pushed a commit to AndreKurait/opensearch-migrations that referenced this pull request Apr 25, 2024
* Checkpoint: improved ConnectionDetails; unit tested it

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* RFS now uses reactor-netty and bulk indexing

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Fixes per PR; unit tested LuceneDocumentsReader

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Updated a unit test name

Signed-off-by: Chris Helma <chelma+github@amazon.com>

* Updated a method name per PR feedback

Signed-off-by: Chris Helma <chelma+github@amazon.com>

---------

Signed-off-by: Chris Helma <chelma+github@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants