-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RFS to CDK #575
Add RFS to CDK #575
Conversation
… fixes Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #575 +/- ##
============================================
- Coverage 76.03% 75.77% -0.27%
+ Complexity 1503 1490 -13
============================================
Files 162 162
Lines 6359 6348 -11
Branches 567 572 +5
============================================
- Hits 4835 4810 -25
- Misses 1151 1161 +10
- Partials 373 377 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is an open-source monorepo for a growing number of projects, I have a strong preference to limit the tracked artifacts to just code/configs. This PR includes test data and that's a very slippery slope that we've so far avoided.
In this case, I think that its easy to workaround the issue with docker layers. Longer term, we should discuss how we'd like to share larger datasets or those that cannot be easily regenerated from code.
RFS/build.gradle
Outdated
|
||
for (dockerService in dockerServices) { | ||
task "buildDockerImage_${dockerService.projectName}" (type: DockerBuildImage) { | ||
if (dockerService.projectName == "reindexFromSnapshot") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this a flag in your DockerServiceProps instead so that the creator of the spec drives whether/how this happens rather than implementation details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure made this a bit cleaner
@@ -0,0 +1,45 @@ | |||
## Test Resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've resisted the urge for the past year to check binary or large datafiles into the (mono) repo. Multi-stage builds seem like the better approach here.
They would be self-documenting, eliminate the need for everybody to pull binary files from git, need to update/maintain those binary files (& worry about the large object changes that would go along with it) as datasets and underlying server versions change. This approach should also be efficient enough for devs working on the images since the layers would be cached once the byproducts were created once - AND give them the aforementioned flexibilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the sample and discussion. Wasn't sure this would be possible within the stage of a Dockerfile but do prefer the reproducibility of this and eliminating the previous approach as a pattern that others might follow to store larger datasets. Have updated to use a multi-stage Docker approach instead now
@@ -362,6 +362,10 @@ public static void main(String[] args) throws InterruptedException { | |||
} | |||
|
|||
logger.info("Documents reindexed successfully"); | |||
|
|||
logger.info("Refreshing newly added documents"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is at info because refresh could take a significant amount of time? (seems reasonable, just checking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for larger datasets this can take a noticeable amount of time
@@ -196,6 +196,18 @@ With the [required setup](#importing-target-clusters) on the target cluster havi | |||
The pipeline configuration file can be viewed (and updated) via AWS Secrets Manager. | |||
Please note that it will be base64 encoded. | |||
|
|||
## Kicking off Reindex from Snapshot (RFS) | |||
|
|||
When the RFS service gets deployed, it does not start running immediately. This is by design to put the needed infrastructure in place, and then allow the user to control when the historical data migration should occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to put the needed infrastructure in place" - I immediately think, who is putting that in place?
The 'why' is confusing here. Does it not auto-start because we want the user to control (& if so, why do there need to be two start steps). If we're waiting for more infrastructure, what specifically? What would the user need to look out for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah was struggling with the wording here. Ultimately wanted to get the point across that the user is in control of the start of RFS, so I've altered this a bit.
aws ecs update-service --cluster migration-<STAGE>-ecs-cluster --service migration-<STAGE>-rfs --desired-count 1 | ||
``` | ||
|
||
Currently, the RFS service will enter an idle state upon completion and can be cleaned up by using the same command with `--desired-count 0` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the container enter that state or the application running within the container doing RFS enter that? The distinction might be important for when a user/dev is debugging a hung shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added details
const allReplayerServiceArn = `arn:aws:ecs:${props.env?.region}:${props.env?.account}:service/migration-${props.stage}-ecs-cluster/migration-${props.stage}-traffic-replayer*` | ||
const ecsClusterArn = `arn:aws:ecs:${props.env?.region}:${props.env?.account}:service/migration-${props.stage}-ecs-cluster` | ||
const allReplayerServiceArn = `${ecsClusterArn}/migration-${props.stage}-traffic-replayer*` | ||
const rfsServiceArn = `${ecsClusterArn}/migration-${props.stage}-rfs` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we please spell out 'rfs' so that it has the same format as traffic-replayer?
I'd make the same argument on the top-level directory name of the repo too, to rename that to ReindexFromSnapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes have changed this to be spelled out to be more clear, and have also changed some references in CDK where this acronym might be confusing
…Images Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
I share these concerns over time, I'd suggest looking at how benchmarks does this for its test workloads [1]. I think there is a sweet spot in the project cycle to locally optimize, but we might be exiting that phase with ever increasing sized test datasets. |
I haven't looked into how OSB handles workloads yet, but here's a Dockerfile that uses build phases to do a once, when used, dataset build. It also self-documents how we've constructed the tarball since anybody can run it. Note that I built on ARM, so I had to install development tools to build a dependency for OSB... all of that will be independent of our final ES image layer - it's only a one-time charge when building the image the first time.
|
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
…-stage Dockerfile Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the generation code. I left a couple comments. I think that you need to tweak the dockerfile a bit to get the advantages that you were trying to achieve.
RUN chmod ug+x /root/generateDataset.sh | ||
|
||
RUN /root/generateDataset.sh ${SHOULD_GENERATE_DATA} && \ | ||
cd /usr/share/elasticsearch/data && tar -cvzf esDataFiles.tgz nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm presuming that you aren't planning on setting up multiple parallel dataset generations (one for each workload, maybe things that aren't OSB, etc) and pull lots of different datasets potentially into the final container. If you wanted to support multiple datasets, I'd presume that you'd want separate images for each so that containers would only need to pay for what they were using.
Presuming that, it will be more efficient to drop the tar command and to just leave the directory intact. If you did think that you'd be generating a lot of images with different sample data. I think once you start looking to testing at scale, you'll want to pull snapshots into distributed clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the tarring step for now. Part of me would like to build more support here to make generating different datasets in parallel a bit easier, but I'm leaning more toward seeing the usefulness of this current piece and then reiterating if we find its something we'd really like. Testing multiple node clusters or even multiple RFS instances will look pretty different than what we have currently so I am also cautious that might shift direction.
RFS/README.md
Outdated
```shell | ||
./gradlew composeUp -Pdataset='small-benchmark-single-node.tar.gz' | ||
./gradlew composeUp -PshouldGenerateData=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generate data isn't quite right since the user may never see generation happen (if they're using a cached layer).
From a docker image perspective, one has OSB datasets from an OSB run and the other doesn't. I would call this something like dataset=osb_4testWorkloads or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have changed this to be similar to dataset=osb_4testWorkloads
|
||
echo "Running opensearch-benchmark workloads against ${endpoint}" | ||
echo "Running opensearch-benchmark w/ 'geonames' workload..." && | ||
opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=geonames --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1" --client-options=$client_options && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked, but didn't find a way to JUST LOAD the data. The opensearch-benchmark runs load data and then they do tests on it. That latter part takes up most of the time & is work that we're not interested in for RFS. We might really, really want to load the data, do an RFS migration, then run the rest of the test so that we could test CDC on a historically migrated cluster.
It might be a good idea to open an issue or submit a PR with a new option for OSB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably raise an issue as I don't have a good understanding of how intertwined these two things are from looking at the actual workloads: https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/geonames
… tarring datasets Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please tweak the beginning of your generateDataStage in the Dockerfile definition, it will create a tighter final image that has less baggage and surprises.
RUN cd /etc/yum.repos.d/ && \ | ||
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-* && \ | ||
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-* | ||
RUN yum install -y gcc python3.9 python39-devel vim git less | ||
RUN pip3 install opensearch-benchmark | ||
|
||
|
||
FROM base AS generateDatasetStage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to rotate the FROM base as generateDatasetStage
upward. As it is, you'll install all of the yum packages for the final version, rather than ONLY propagate the ES data that you generated. As this is now, there's no reason to do a multistage build because you'll pretty much have the same amount of stuff in your final image - and there's going to be a lot that you don't need!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this was only a test ES source Docker image that we don't vend, its been helpful to have these packages in the final version for any adhoc testing when running RFS locally. As I'm thinking about this more it may make sense to have the migration console be a part of the docker compose and let it handle things like this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want those things (gcc, python-dev, OSB) on the ES container? Why isn’t the migration console sufficient?
Oh - you don’t have a unified docker env. yet for that distribution, do you? Hmmm - I guess that makes more sense. You should put a comment in the file to explain that.
Maybe I’ll pick up the task to lift the dockerSolution out of the TrafficCapture directory tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a comment explaining why the base starts as late as it does.
Please add a line item to https://opensearch.atlassian.net/browse/MIGRATIONS-1628 to update this to 'un-migration-consolify' this image
Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Sure have added a TODO on the Dockerfile and updated the Jira task |
# Conflicts: # RFS/README.md # RFS/build.gradle # RFS/docker/TestSource_ES_7_10/Dockerfile
This changes adds RFS as an ECS service that can be enabled in the migration CDK. Also included are changes to make RFS usable in the E2E test script. Also included are improvements to data generation for the RFS ES test source Dockerfile to use multi-stages instead of static binary files in the repo. Signed-off-by: Tanner Lewis <lewijacn@amazon.com>
Description
Note: This builds off changes being reviewed here: #566
This changes adds RFS as an ECS service that can be enabled in the migration CDK.
Also included are changes to make RFS usable in the E2E test script.
Issues Resolved
https://opensearch.atlassian.net/browse/MIGRATIONS-1604
https://opensearch.atlassian.net/browse/MIGRATIONS-1637
Testing
CDK and E2E script deployment testing
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.