Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RFS to CDK #575

Merged
merged 17 commits into from
Apr 16, 2024
Merged

Add RFS to CDK #575

merged 17 commits into from
Apr 16, 2024

Conversation

lewijacn
Copy link
Collaborator

@lewijacn lewijacn commented Apr 10, 2024

Description

Note: This builds off changes being reviewed here: #566

This changes adds RFS as an ECS service that can be enabled in the migration CDK.

Also included are changes to make RFS usable in the E2E test script.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1604
https://opensearch.atlassian.net/browse/MIGRATIONS-1637

Testing

CDK and E2E script deployment testing

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

… fixes

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Copy link

codecov bot commented Apr 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.77%. Comparing base (fe88c1e) to head (2819114).
Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #575      +/-   ##
============================================
- Coverage     76.03%   75.77%   -0.27%     
+ Complexity     1503     1490      -13     
============================================
  Files           162      162              
  Lines          6359     6348      -11     
  Branches        567      572       +5     
============================================
- Hits           4835     4810      -25     
- Misses         1151     1161      +10     
- Partials        373      377       +4     
Flag Coverage Δ
unittests 75.77% <ø> (-0.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Copy link
Collaborator

@gregschohn gregschohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is an open-source monorepo for a growing number of projects, I have a strong preference to limit the tracked artifacts to just code/configs. This PR includes test data and that's a very slippery slope that we've so far avoided.

In this case, I think that its easy to workaround the issue with docker layers. Longer term, we should discuss how we'd like to share larger datasets or those that cannot be easily regenerated from code.

RFS/build.gradle Outdated

for (dockerService in dockerServices) {
task "buildDockerImage_${dockerService.projectName}" (type: DockerBuildImage) {
if (dockerService.projectName == "reindexFromSnapshot") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this a flag in your DockerServiceProps instead so that the creator of the spec drives whether/how this happens rather than implementation details?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure made this a bit cleaner

@@ -0,0 +1,45 @@
## Test Resources
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've resisted the urge for the past year to check binary or large datafiles into the (mono) repo. Multi-stage builds seem like the better approach here.
They would be self-documenting, eliminate the need for everybody to pull binary files from git, need to update/maintain those binary files (& worry about the large object changes that would go along with it) as datasets and underlying server versions change. This approach should also be efficient enough for devs working on the images since the layers would be cached once the byproducts were created once - AND give them the aforementioned flexibilities.

Copy link
Collaborator Author

@lewijacn lewijacn Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the sample and discussion. Wasn't sure this would be possible within the stage of a Dockerfile but do prefer the reproducibility of this and eliminating the previous approach as a pattern that others might follow to store larger datasets. Have updated to use a multi-stage Docker approach instead now

@@ -362,6 +362,10 @@ public static void main(String[] args) throws InterruptedException {
}

logger.info("Documents reindexed successfully");

logger.info("Refreshing newly added documents");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at info because refresh could take a significant amount of time? (seems reasonable, just checking)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for larger datasets this can take a noticeable amount of time

@@ -196,6 +196,18 @@ With the [required setup](#importing-target-clusters) on the target cluster havi
The pipeline configuration file can be viewed (and updated) via AWS Secrets Manager.
Please note that it will be base64 encoded.

## Kicking off Reindex from Snapshot (RFS)

When the RFS service gets deployed, it does not start running immediately. This is by design to put the needed infrastructure in place, and then allow the user to control when the historical data migration should occur.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to put the needed infrastructure in place" - I immediately think, who is putting that in place?
The 'why' is confusing here. Does it not auto-start because we want the user to control (& if so, why do there need to be two start steps). If we're waiting for more infrastructure, what specifically? What would the user need to look out for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah was struggling with the wording here. Ultimately wanted to get the point across that the user is in control of the start of RFS, so I've altered this a bit.

aws ecs update-service --cluster migration-<STAGE>-ecs-cluster --service migration-<STAGE>-rfs --desired-count 1
```

Currently, the RFS service will enter an idle state upon completion and can be cleaned up by using the same command with `--desired-count 0`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the container enter that state or the application running within the container doing RFS enter that? The distinction might be important for when a user/dev is debugging a hung shutdown.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added details

const allReplayerServiceArn = `arn:aws:ecs:${props.env?.region}:${props.env?.account}:service/migration-${props.stage}-ecs-cluster/migration-${props.stage}-traffic-replayer*`
const ecsClusterArn = `arn:aws:ecs:${props.env?.region}:${props.env?.account}:service/migration-${props.stage}-ecs-cluster`
const allReplayerServiceArn = `${ecsClusterArn}/migration-${props.stage}-traffic-replayer*`
const rfsServiceArn = `${ecsClusterArn}/migration-${props.stage}-rfs`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please spell out 'rfs' so that it has the same format as traffic-replayer?
I'd make the same argument on the top-level directory name of the repo too, to rename that to ReindexFromSnapshot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes have changed this to be spelled out to be more clear, and have also changed some references in CDK where this acronym might be confusing

…Images

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
@peternied
Copy link
Member

Given that this is an open-source monorepo for a growing number of projects, I have a strong preference to limit the tracked artifacts to just code/configs. This PR includes test data and that's a very slippery slope that we've so far avoided.

In this case, I think that its easy to workaround the issue with docker layers. Longer term, we should discuss how we'd like to share larger datasets or those that cannot be easily regenerated from code.

I share these concerns over time, I'd suggest looking at how benchmarks does this for its test workloads [1]. I think there is a sweet spot in the project cycle to locally optimize, but we might be exiting that phase with ever increasing sized test datasets.

@gregschohn
Copy link
Collaborator

I haven't looked into how OSB handles workloads yet, but here's a Dockerfile that uses build phases to do a once, when used, dataset build. It also self-documents how we've constructed the tarball since anybody can run it. Note that I built on ARM, so I had to install development tools to build a dependency for OSB... all of that will be independent of our final ES image layer - it's only a one-time charge when building the image the first time.

FROM docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 as base

ENV ELASTIC_SEARCH_CONFIG_FILE=/usr/share/elasticsearch/config/elasticsearch.yml

# without this line, elasticsearch will complain that there aren't enough nodes
RUN echo "discovery.type: single-node" >> $ELASTIC_SEARCH_CONFIG_FILE


FROM base as makeDataSet1

RUN cd /etc/yum.repos.d/ && \
    sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-* && \
    sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
RUN yum install -y python3.9
RUN yum install -y gcc python39-devel
RUN yum install -y vim git less

RUN pip3 install opensearch-benchmark

RUN /usr/local/bin/docker-entrypoint.sh eswrapper & echo $! > /tmp/esWrapperProcess.pid && sleep 10 &&\
     opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=http://localhost:9200 \
    --workload=geonames --pipeline=benchmark-only --test-mode --kill-running-processes  \
    --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1" &&  \
    tar -cvzf esDataFiles.tgz /usr/share/elasticsearch/data


FROM base

RUN echo y | /usr/share/elasticsearch/bin/elasticsearch-plugin install https://maven.search-guard.com/search-guard-suite-release/com/floragunn/search-guard-suite-plugin/7.10.2-53.5.0/search-guard-suite-plugin-7.10.2-53.5.0.zip
# add search-guard, which provides TLS, as well as other features that we don't need to consider
RUN pushd /usr/share/elasticsearch/plugins/search-guard-7/tools ; chmod ugo+x ./install_demo_configuration.sh ; yes | ./install_demo_configuration.sh ; popd

ENV PROXY_TLS_CONFIG_FILE=/usr/share/elasticsearch/config/proxy_tls.yml
COPY disableTlsConfig.sh enableTlsConfig.sh /root/
RUN chmod ugo+x /root/disableTlsConfig.sh /root/enableTlsConfig.sh
ENV PATH=${PATH}:/usr/share/elasticsearch/jdk/bin/
RUN sed 's/searchguard/plugins.security/g' $ELASTIC_SEARCH_CONFIG_FILE |  \
    grep ssl.http > $PROXY_TLS_CONFIG_FILE

# The following two commands are more convenient for development purposes,
# but maybe not for a demo to show individual steps
RUN /root/enableTlsConfig.sh $ELASTIC_SEARCH_CONFIG_FILE
# Alter this config line to either enable(searchguard.disabled: false) or disable(searchguard.disabled: true) HTTP auth
RUN echo "searchguard.disabled: false" >> $ELASTIC_SEARCH_CONFIG_FILE

COPY --from=makeDataSet1 /usr/share/elasticsearch/esDataFiles.tgz /root/esDataFiles.tgz

CMD /usr/local/bin/docker-entrypoint.sh eswrapper

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
…-stage Dockerfile

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Copy link
Collaborator

@gregschohn gregschohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the generation code. I left a couple comments. I think that you need to tweak the dockerfile a bit to get the advantages that you were trying to achieve.

RUN chmod ug+x /root/generateDataset.sh

RUN /root/generateDataset.sh ${SHOULD_GENERATE_DATA} && \
cd /usr/share/elasticsearch/data && tar -cvzf esDataFiles.tgz nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm presuming that you aren't planning on setting up multiple parallel dataset generations (one for each workload, maybe things that aren't OSB, etc) and pull lots of different datasets potentially into the final container. If you wanted to support multiple datasets, I'd presume that you'd want separate images for each so that containers would only need to pay for what they were using.

Presuming that, it will be more efficient to drop the tar command and to just leave the directory intact. If you did think that you'd be generating a lot of images with different sample data. I think once you start looking to testing at scale, you'll want to pull snapshots into distributed clusters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the tarring step for now. Part of me would like to build more support here to make generating different datasets in parallel a bit easier, but I'm leaning more toward seeing the usefulness of this current piece and then reiterating if we find its something we'd really like. Testing multiple node clusters or even multiple RFS instances will look pretty different than what we have currently so I am also cautious that might shift direction.

RFS/README.md Outdated
```shell
./gradlew composeUp -Pdataset='small-benchmark-single-node.tar.gz'
./gradlew composeUp -PshouldGenerateData=true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate data isn't quite right since the user may never see generation happen (if they're using a cached layer).
From a docker image perspective, one has OSB datasets from an OSB run and the other doesn't. I would call this something like dataset=osb_4testWorkloads or something like that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed this to be similar to dataset=osb_4testWorkloads


echo "Running opensearch-benchmark workloads against ${endpoint}"
echo "Running opensearch-benchmark w/ 'geonames' workload..." &&
opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=geonames --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1" --client-options=$client_options &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, but didn't find a way to JUST LOAD the data. The opensearch-benchmark runs load data and then they do tests on it. That latter part takes up most of the time & is work that we're not interested in for RFS. We might really, really want to load the data, do an RFS migration, then run the rest of the test so that we could test CDC on a historically migrated cluster.

It might be a good idea to open an issue or submit a PR with a new option for OSB.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably raise an issue as I don't have a good understanding of how intertwined these two things are from looking at the actual workloads: https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/geonames

… tarring datasets

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Copy link
Collaborator

@gregschohn gregschohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please tweak the beginning of your generateDataStage in the Dockerfile definition, it will create a tighter final image that has less baggage and surprises.

Comment on lines 9 to 16
RUN cd /etc/yum.repos.d/ && \
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-* && \
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-*
RUN yum install -y gcc python3.9 python39-devel vim git less
RUN pip3 install opensearch-benchmark


FROM base AS generateDatasetStage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to rotate the FROM base as generateDatasetStage upward. As it is, you'll install all of the yum packages for the final version, rather than ONLY propagate the ES data that you generated. As this is now, there's no reason to do a multistage build because you'll pretty much have the same amount of stuff in your final image - and there's going to be a lot that you don't need!

Copy link
Collaborator Author

@lewijacn lewijacn Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this was only a test ES source Docker image that we don't vend, its been helpful to have these packages in the final version for any adhoc testing when running RFS locally. As I'm thinking about this more it may make sense to have the migration console be a part of the docker compose and let it handle things like this in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want those things (gcc, python-dev, OSB) on the ES container? Why isn’t the migration console sufficient?

Oh - you don’t have a unified docker env. yet for that distribution, do you? Hmmm - I guess that makes more sense. You should put a comment in the file to explain that.

Maybe I’ll pick up the task to lift the dockerSolution out of the TrafficCapture directory tomorrow

Copy link
Collaborator

@gregschohn gregschohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a comment explaining why the base starts as late as it does.
Please add a line item to https://opensearch.atlassian.net/browse/MIGRATIONS-1628 to update this to 'un-migration-consolify' this image

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
@lewijacn
Copy link
Collaborator Author

I'd like to see a comment explaining why the base starts as late as it does. Please add a line item to https://opensearch.atlassian.net/browse/MIGRATIONS-1628 to update this to 'un-migration-consolify' this image

Sure have added a TODO on the Dockerfile and updated the Jira task

# Conflicts:
#	RFS/README.md
#	RFS/build.gradle
#	RFS/docker/TestSource_ES_7_10/Dockerfile
@lewijacn lewijacn merged commit d72a228 into opensearch-project:main Apr 16, 2024
7 of 8 checks passed
gregschohn pushed a commit that referenced this pull request Apr 16, 2024
This changes adds RFS as an ECS service that can be enabled in the migration CDK. Also included are changes to make RFS usable in the E2E test script. Also included are improvements to data generation for the RFS ES test source Dockerfile to use multi-stages instead of static binary files in the repo.

Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
@lewijacn lewijacn deleted the rfs-cdk branch May 23, 2024 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants