Add RFS to CDK #575

lewijacn · 2024-04-10T21:55:23Z

Description

Note: This builds off changes being reviewed here: #566

This changes adds RFS as an ECS service that can be enabled in the migration CDK.

Also included are changes to make RFS usable in the E2E test script.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1604
https://opensearch.atlassian.net/browse/MIGRATIONS-1637

Testing

CDK and E2E script deployment testing

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

… fixes Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

codecov · 2024-04-11T14:03:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.77%. Comparing base (fe88c1e) to head (2819114).
Report is 6 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #575      +/-   ##
============================================
- Coverage     76.03%   75.77%   -0.27%     
+ Complexity     1503     1490      -13     
============================================
  Files           162      162              
  Lines          6359     6348      -11     
  Branches        567      572       +5     
============================================
- Hits           4835     4810      -25     
- Misses         1151     1161      +10     
- Partials        373      377       +4

Flag	Coverage Δ
unittests	`75.77% <ø> (-0.27%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

gregschohn

Given that this is an open-source monorepo for a growing number of projects, I have a strong preference to limit the tracked artifacts to just code/configs. This PR includes test data and that's a very slippery slope that we've so far avoided.

In this case, I think that its easy to workaround the issue with docker layers. Longer term, we should discuss how we'd like to share larger datasets or those that cannot be easily regenerated from code.

gregschohn · 2024-04-11T17:22:47Z

RFS/build.gradle

+
+for (dockerService in dockerServices) {
+    task "buildDockerImage_${dockerService.projectName}" (type: DockerBuildImage) {
+        if (dockerService.projectName == "reindexFromSnapshot") {


Could you make this a flag in your DockerServiceProps instead so that the creator of the spec drives whether/how this happens rather than implementation details?

Sure made this a bit cleaner

gregschohn · 2024-04-11T17:46:04Z

RFS/docker/TestSource_ES_7_10/test-resources/inventory.md

@@ -0,0 +1,45 @@
+## Test Resources


We've resisted the urge for the past year to check binary or large datafiles into the (mono) repo. Multi-stage builds seem like the better approach here.
They would be self-documenting, eliminate the need for everybody to pull binary files from git, need to update/maintain those binary files (& worry about the large object changes that would go along with it) as datasets and underlying server versions change. This approach should also be efficient enough for devs working on the images since the layers would be cached once the byproducts were created once - AND give them the aforementioned flexibilities.

Thanks for the sample and discussion. Wasn't sure this would be possible within the stage of a Dockerfile but do prefer the reproducibility of this and eliminating the previous approach as a pattern that others might follow to store larger datasets. Have updated to use a multi-stage Docker approach instead now

gregschohn · 2024-04-11T17:47:22Z

RFS/src/main/java/com/rfs/ReindexFromSnapshot.java

@@ -362,6 +362,10 @@ public static void main(String[] args) throws InterruptedException {
                }

                logger.info("Documents reindexed successfully");
+
+                logger.info("Refreshing newly added documents");


This is at info because refresh could take a significant amount of time? (seems reasonable, just checking)

Yes for larger datasets this can take a noticeable amount of time

gregschohn · 2024-04-11T17:50:36Z

deployment/cdk/opensearch-service-migration/README.md

@@ -196,6 +196,18 @@ With the [required setup](#importing-target-clusters) on the target cluster havi
 The pipeline configuration file can be viewed (and updated) via AWS Secrets Manager.
 Please note that it will be base64 encoded.

+## Kicking off Reindex from Snapshot (RFS)
+
+When the RFS service gets deployed, it does not start running immediately. This is by design to put the needed infrastructure in place, and then allow the user to control when the historical data migration should occur.


"to put the needed infrastructure in place" - I immediately think, who is putting that in place?
The 'why' is confusing here. Does it not auto-start because we want the user to control (& if so, why do there need to be two start steps). If we're waiting for more infrastructure, what specifically? What would the user need to look out for?

Yeah was struggling with the wording here. Ultimately wanted to get the point across that the user is in control of the start of RFS, so I've altered this a bit.

gregschohn · 2024-04-11T17:52:33Z

deployment/cdk/opensearch-service-migration/README.md

+aws ecs update-service --cluster migration-<STAGE>-ecs-cluster --service migration-<STAGE>-rfs --desired-count 1
+```
+
+Currently, the RFS service will enter an idle state upon completion and can be cleaned up by using the same command with `--desired-count 0`


Does the container enter that state or the application running within the container doing RFS enter that? The distinction might be important for when a user/dev is debugging a hung shutdown.

Added details

gregschohn · 2024-04-11T17:54:50Z

deployment/cdk/opensearch-service-migration/lib/service-stacks/migration-console-stack.ts

-        const allReplayerServiceArn = `arn:aws:ecs:${props.env?.region}:${props.env?.account}:service/migration-${props.stage}-ecs-cluster/migration-${props.stage}-traffic-replayer*`
+        const ecsClusterArn = `arn:aws:ecs:${props.env?.region}:${props.env?.account}:service/migration-${props.stage}-ecs-cluster`
+        const allReplayerServiceArn = `${ecsClusterArn}/migration-${props.stage}-traffic-replayer*`
+        const rfsServiceArn = `${ecsClusterArn}/migration-${props.stage}-rfs`


can we please spell out 'rfs' so that it has the same format as traffic-replayer?
I'd make the same argument on the top-level directory name of the repo too, to rename that to ReindexFromSnapshot.

Yes have changed this to be spelled out to be more clear, and have also changed some references in CDK where this acronym might be confusing

…Images Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

peternied · 2024-04-11T20:31:12Z

Given that this is an open-source monorepo for a growing number of projects, I have a strong preference to limit the tracked artifacts to just code/configs. This PR includes test data and that's a very slippery slope that we've so far avoided.

In this case, I think that its easy to workaround the issue with docker layers. Longer term, we should discuss how we'd like to share larger datasets or those that cannot be easily regenerated from code.

I share these concerns over time, I'd suggest looking at how benchmarks does this for its test workloads [1]. I think there is a sweet spot in the project cycle to locally optimize, but we might be exiting that phase with ever increasing sized test datasets.

[1] https://github.com/opensearch-project/opensearch-benchmark-workloads

gregschohn · 2024-04-11T23:01:54Z

I haven't looked into how OSB handles workloads yet, but here's a Dockerfile that uses build phases to do a once, when used, dataset build. It also self-documents how we've constructed the tarball since anybody can run it. Note that I built on ARM, so I had to install development tools to build a dependency for OSB... all of that will be independent of our final ES image layer - it's only a one-time charge when building the image the first time.

FROM docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 as base

ENV ELASTIC_SEARCH_CONFIG_FILE=/usr/share/elasticsearch/config/elasticsearch.yml

# without this line, elasticsearch will complain that there aren't enough nodes
RUN echo "discovery.type: single-node" >> $ELASTIC_SEARCH_CONFIG_FILE


FROM base as makeDataSet1

RUN cd /etc/yum.repos.d/ && \
    sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-* && \
    sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
RUN yum install -y python3.9
RUN yum install -y gcc python39-devel
RUN yum install -y vim git less

RUN pip3 install opensearch-benchmark

RUN /usr/local/bin/docker-entrypoint.sh eswrapper & echo $! > /tmp/esWrapperProcess.pid && sleep 10 &&\
     opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=http://localhost:9200 \
    --workload=geonames --pipeline=benchmark-only --test-mode --kill-running-processes  \
    --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1" &&  \
    tar -cvzf esDataFiles.tgz /usr/share/elasticsearch/data


FROM base

RUN echo y | /usr/share/elasticsearch/bin/elasticsearch-plugin install https://maven.search-guard.com/search-guard-suite-release/com/floragunn/search-guard-suite-plugin/7.10.2-53.5.0/search-guard-suite-plugin-7.10.2-53.5.0.zip
# add search-guard, which provides TLS, as well as other features that we don't need to consider
RUN pushd /usr/share/elasticsearch/plugins/search-guard-7/tools ; chmod ugo+x ./install_demo_configuration.sh ; yes | ./install_demo_configuration.sh ; popd

ENV PROXY_TLS_CONFIG_FILE=/usr/share/elasticsearch/config/proxy_tls.yml
COPY disableTlsConfig.sh enableTlsConfig.sh /root/
RUN chmod ugo+x /root/disableTlsConfig.sh /root/enableTlsConfig.sh
ENV PATH=${PATH}:/usr/share/elasticsearch/jdk/bin/
RUN sed 's/searchguard/plugins.security/g' $ELASTIC_SEARCH_CONFIG_FILE |  \
    grep ssl.http > $PROXY_TLS_CONFIG_FILE

# The following two commands are more convenient for development purposes,
# but maybe not for a demo to show individual steps
RUN /root/enableTlsConfig.sh $ELASTIC_SEARCH_CONFIG_FILE
# Alter this config line to either enable(searchguard.disabled: false) or disable(searchguard.disabled: true) HTTP auth
RUN echo "searchguard.disabled: false" >> $ELASTIC_SEARCH_CONFIG_FILE

COPY --from=makeDataSet1 /usr/share/elasticsearch/esDataFiles.tgz /root/esDataFiles.tgz

CMD /usr/local/bin/docker-entrypoint.sh eswrapper

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

…-stage Dockerfile Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

gregschohn

Thanks for adding the generation code. I left a couple comments. I think that you need to tweak the dockerfile a bit to get the advantages that you were trying to achieve.

gregschohn · 2024-04-12T19:34:30Z

RFS/docker/TestSource_ES_7_10/Dockerfile

+RUN chmod ug+x /root/generateDataset.sh
+
+RUN /root/generateDataset.sh ${SHOULD_GENERATE_DATA} &&  \
+    cd /usr/share/elasticsearch/data && tar -cvzf esDataFiles.tgz nodes


I'm presuming that you aren't planning on setting up multiple parallel dataset generations (one for each workload, maybe things that aren't OSB, etc) and pull lots of different datasets potentially into the final container. If you wanted to support multiple datasets, I'd presume that you'd want separate images for each so that containers would only need to pay for what they were using.

Presuming that, it will be more efficient to drop the tar command and to just leave the directory intact. If you did think that you'd be generating a lot of images with different sample data. I think once you start looking to testing at scale, you'll want to pull snapshots into distributed clusters.

I have removed the tarring step for now. Part of me would like to build more support here to make generating different datasets in parallel a bit easier, but I'm leaning more toward seeing the usefulness of this current piece and then reiterating if we find its something we'd really like. Testing multiple node clusters or even multiple RFS instances will look pretty different than what we have currently so I am also cautious that might shift direction.

gregschohn · 2024-04-12T19:37:08Z

RFS/README.md

 ```shell
-./gradlew composeUp -Pdataset='small-benchmark-single-node.tar.gz'
+./gradlew composeUp -PshouldGenerateData=true


generate data isn't quite right since the user may never see generation happen (if they're using a cached layer).
From a docker image perspective, one has OSB datasets from an OSB run and the other doesn't. I would call this something like dataset=osb_4testWorkloads or something like that.

Have changed this to be similar to dataset=osb_4testWorkloads

gregschohn · 2024-04-12T19:51:47Z

RFS/docker/TestSource_ES_7_10/generateDataset.sh

+
+  echo "Running opensearch-benchmark workloads against ${endpoint}"
+  echo "Running opensearch-benchmark w/ 'geonames' workload..." &&
+  opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=geonames --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1" --client-options=$client_options &&


I checked, but didn't find a way to JUST LOAD the data. The opensearch-benchmark runs load data and then they do tests on it. That latter part takes up most of the time & is work that we're not interested in for RFS. We might really, really want to load the data, do an RFS migration, then run the rest of the test so that we could test CDC on a historically migrated cluster.

It might be a good idea to open an issue or submit a PR with a new option for OSB.

We should probably raise an issue as I don't have a good understanding of how intertwined these two things are from looking at the actual workloads: https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/geonames

… tarring datasets Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

gregschohn

Please tweak the beginning of your generateDataStage in the Dockerfile definition, it will create a tighter final image that has less baggage and surprises.

gregschohn · 2024-04-15T23:10:35Z

RFS/docker/TestSource_ES_7_10/Dockerfile

+RUN cd /etc/yum.repos.d/ && \
+    sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-* && \
+    sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
+RUN yum install -y gcc python3.9 python39-devel vim git less
+RUN pip3 install opensearch-benchmark
+
+
+FROM base AS generateDatasetStage


You need to rotate the FROM base as generateDatasetStage upward. As it is, you'll install all of the yum packages for the final version, rather than ONLY propagate the ES data that you generated. As this is now, there's no reason to do a multistage build because you'll pretty much have the same amount of stuff in your final image - and there's going to be a lot that you don't need!

Since this was only a test ES source Docker image that we don't vend, its been helpful to have these packages in the final version for any adhoc testing when running RFS locally. As I'm thinking about this more it may make sense to have the migration console be a part of the docker compose and let it handle things like this in the future.

Why do you want those things (gcc, python-dev, OSB) on the ES container? Why isn’t the migration console sufficient?

Oh - you don’t have a unified docker env. yet for that distribution, do you? Hmmm - I guess that makes more sense. You should put a comment in the file to explain that.

Maybe I’ll pick up the task to lift the dockerSolution out of the TrafficCapture directory tomorrow

gregschohn

I'd like to see a comment explaining why the base starts as late as it does.
Please add a line item to https://opensearch.atlassian.net/browse/MIGRATIONS-1628 to update this to 'un-migration-consolify' this image

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

lewijacn · 2024-04-16T15:25:44Z

I'd like to see a comment explaining why the base starts as late as it does. Please add a line item to https://opensearch.atlassian.net/browse/MIGRATIONS-1628 to update this to 'un-migration-consolify' this image

Sure have added a TODO on the Dockerfile and updated the Jira task

# Conflicts: # RFS/README.md # RFS/build.gradle # RFS/docker/TestSource_ES_7_10/Dockerfile

This changes adds RFS as an ECS service that can be enabled in the migration CDK. Also included are changes to make RFS usable in the E2E test script. Also included are improvements to data generation for the RFS ES test source Dockerfile to use multi-stages instead of static binary files in the repo. Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

lewijacn added 5 commits April 9, 2024 11:48

Add docker compose setup for RFS with preloading data, also minor NPE…

55c74d1

… fixes Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Update preload data mechanism and add min replicas attribute for now

ee71868

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Add structure for RFS service in CDK

9ee2113

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Minor changes after testing RFS E2E with CDK

40d2564

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Improvements to RFS gradle build, documentation, RFS completion refresh

f618704

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

lewijacn requested review from AndreKurait, chelma, gregschohn, kartg, mikaylathompson, okhasawn and sumobrian as code owners April 10, 2024 21:55

lewijacn added 2 commits April 11, 2024 09:51

Merge remote-tracking branch 'origin/main' into setup-rfs-compose

e42989c

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Merge remote-tracking branch 'origin/setup-rfs-compose' into rfs-cdk

6f0814b

lewijacn added 2 commits April 11, 2024 10:19

Add doc for starting RFS with CDK and update CDK test cases

d0558b7

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Update index file

f66a4b8

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

gregschohn reviewed Apr 11, 2024

View reviewed changes

Minor additional to build Docker image for RFS as well in buildDocker…

0d5c8a7

…Images Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

lewijacn added 4 commits April 12, 2024 10:03

Minor updates per PR feedback

231e4f0

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Remove archive data file in favor of generating datasets with a multi…

967966a

…-stage Dockerfile Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Merge remote-tracking branch 'origin/main' into setup-rfs-compose

5657830

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Merge remote-tracking branch 'origin/setup-rfs-compose' into rfs-cdk

c8ab364

gregschohn reviewed Apr 12, 2024

View reviewed changes

Update naming for preload data RFS and slightly alter pattern. Remove…

378c3fa

… tarring datasets Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

gregschohn reviewed Apr 15, 2024

View reviewed changes

gregschohn approved these changes Apr 16, 2024

View reviewed changes

Add comment for RFS ES Source Dockerfile

4f4e5a3

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>

Merge remote-tracking branch 'origin/main' into rfs-cdk

2819114

# Conflicts: # RFS/README.md # RFS/build.gradle # RFS/docker/TestSource_ES_7_10/Dockerfile

lewijacn merged commit d72a228 into opensearch-project:main Apr 16, 2024
7 of 8 checks passed

lewijacn deleted the rfs-cdk branch May 23, 2024 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RFS to CDK #575

Add RFS to CDK #575

lewijacn commented Apr 10, 2024 •

edited

Loading

codecov bot commented Apr 11, 2024 •

edited

Loading

gregschohn left a comment

gregschohn Apr 11, 2024

lewijacn Apr 11, 2024

gregschohn Apr 11, 2024

lewijacn Apr 12, 2024 •

edited

Loading

gregschohn Apr 11, 2024

lewijacn Apr 11, 2024

gregschohn Apr 11, 2024

lewijacn Apr 11, 2024

gregschohn Apr 11, 2024

lewijacn Apr 11, 2024

gregschohn Apr 11, 2024

lewijacn Apr 11, 2024

peternied commented Apr 11, 2024

gregschohn commented Apr 11, 2024

gregschohn left a comment

gregschohn Apr 12, 2024

lewijacn Apr 15, 2024

gregschohn Apr 12, 2024

lewijacn Apr 15, 2024

gregschohn Apr 12, 2024

lewijacn Apr 15, 2024

gregschohn left a comment

gregschohn Apr 15, 2024

lewijacn Apr 16, 2024 •

edited

Loading

gregschohn Apr 16, 2024

gregschohn left a comment

lewijacn commented Apr 16, 2024

Add RFS to CDK #575

Add RFS to CDK #575

Conversation

lewijacn commented Apr 10, 2024 • edited Loading

Description

Issues Resolved

Testing

Check List

codecov bot commented Apr 11, 2024 • edited Loading

Codecov Report

gregschohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewijacn Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peternied commented Apr 11, 2024

gregschohn commented Apr 11, 2024

gregschohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregschohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewijacn Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregschohn left a comment

Choose a reason for hiding this comment

lewijacn commented Apr 16, 2024

lewijacn commented Apr 10, 2024 •

edited

Loading

codecov bot commented Apr 11, 2024 •

edited

Loading

lewijacn Apr 12, 2024 •

edited

Loading

lewijacn Apr 16, 2024 •

edited

Loading