-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RFS to CDK #575
Add RFS to CDK #575
Changes from 3 commits
55c74d1
ee71868
9ee2113
40d2564
f618704
e42989c
6f0814b
d0558b7
f66a4b8
0d5c8a7
231e4f0
967966a
5657830
c8ab364
378c3fa
4f4e5a3
2819114
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,29 @@ | ||
FROM docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 | ||
FROM docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 AS base | ||
|
||
# Configure Elastic | ||
ENV ELASTIC_SEARCH_CONFIG_FILE=/usr/share/elasticsearch/config/elasticsearch.yml | ||
# Prevents ES from complaining about nodes count | ||
RUN echo "discovery.type: single-node" >> $ELASTIC_SEARCH_CONFIG_FILE | ||
ENV PATH=${PATH}:/usr/share/elasticsearch/jdk/bin/ | ||
|
||
RUN cd /etc/yum.repos.d/ && \ | ||
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-* && \ | ||
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-* && \ | ||
yum install -y python3.9 vim git && \ | ||
pip3 install opensearch-benchmark | ||
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-* | ||
RUN yum install -y gcc python3.9 python39-devel vim git less | ||
RUN pip3 install opensearch-benchmark | ||
|
||
|
||
FROM base AS generateDatasetStage | ||
|
||
ARG SHOULD_GENERATE_DATA=false | ||
COPY generateDataset.sh /root | ||
RUN chmod ug+x /root/generateDataset.sh | ||
|
||
RUN /root/generateDataset.sh ${SHOULD_GENERATE_DATA} && \ | ||
cd /usr/share/elasticsearch/data && tar -cvzf esDataFiles.tgz nodes | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm presuming that you aren't planning on setting up multiple parallel dataset generations (one for each workload, maybe things that aren't OSB, etc) and pull lots of different datasets potentially into the final container. If you wanted to support multiple datasets, I'd presume that you'd want separate images for each so that containers would only need to pay for what they were using. Presuming that, it will be more efficient to drop the tar command and to just leave the directory intact. If you did think that you'd be generating a lot of images with different sample data. I think once you start looking to testing at scale, you'll want to pull snapshots into distributed clusters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have removed the tarring step for now. Part of me would like to build more support here to make generating different datasets in parallel a bit easier, but I'm leaning more toward seeing the usefulness of this current piece and then reiterating if we find its something we'd really like. Testing multiple node clusters or even multiple RFS instances will look pretty different than what we have currently so I am also cautious that might shift direction. |
||
|
||
FROM base | ||
|
||
# Install the S3 Repo Plugin | ||
RUN echo y | /usr/share/elasticsearch/bin/elasticsearch-plugin install repository-s3 | ||
|
@@ -14,21 +33,12 @@ RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2 | |
unzip awscliv2.zip && \ | ||
./aws/install | ||
|
||
ARG EXISTING_DATA="no-data.tar.gz" | ||
|
||
RUN mkdir /snapshots && chown elasticsearch /snapshots | ||
COPY ./test-resources/${EXISTING_DATA} /usr/share/elasticsearch | ||
RUN tar -xzf /usr/share/elasticsearch/${EXISTING_DATA} -C /usr/share/elasticsearch/data && \ | ||
chown -R elasticsearch /usr/share/elasticsearch/data && \ | ||
rm /usr/share/elasticsearch/${EXISTING_DATA} | ||
|
||
COPY --from=generateDatasetStage /usr/share/elasticsearch/data/esDataFiles.tgz /root/esDataFiles.tgz | ||
# Install our custom entrypoint script | ||
COPY ./container-start.sh /usr/share/elasticsearch/container-start.sh | ||
|
||
# Configure Elastic | ||
ENV ELASTIC_SEARCH_CONFIG_FILE=/usr/share/elasticsearch/config/elasticsearch.yml | ||
# Prevents ES from complaining about nodes coun | ||
RUN echo "discovery.type: single-node" >> $ELASTIC_SEARCH_CONFIG_FILE | ||
ENV PATH=${PATH}:/usr/share/elasticsearch/jdk/bin/ | ||
RUN tar -xzf /root/esDataFiles.tgz -C /usr/share/elasticsearch/data | ||
|
||
CMD /usr/share/elasticsearch/container-start.sh |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
#!/bin/bash | ||
|
||
generate_data_requests() { | ||
endpoint="http://localhost:9200" | ||
# If auth or SSL is used, the correlating OSB options should be provided in this array | ||
options=() | ||
client_options=$(IFS=,; echo "${options[*]}") | ||
set -o xtrace | ||
|
||
echo "Running opensearch-benchmark workloads against ${endpoint}" | ||
echo "Running opensearch-benchmark w/ 'geonames' workload..." && | ||
opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=geonames --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1" --client-options=$client_options && | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I checked, but didn't find a way to JUST LOAD the data. The opensearch-benchmark runs load data and then they do tests on it. That latter part takes up most of the time & is work that we're not interested in for RFS. We might really, really want to load the data, do an RFS migration, then run the rest of the test so that we could test CDC on a historically migrated cluster. It might be a good idea to open an issue or submit a PR with a new option for OSB. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably raise an issue as I don't have a good understanding of how intertwined these two things are from looking at the actual workloads: https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/geonames |
||
echo "Running opensearch-benchmark w/ 'http_logs' workload..." && | ||
opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=http_logs --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1" --client-options=$client_options && | ||
echo "Running opensearch-benchmark w/ 'nested' workload..." && | ||
opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=nested --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1" --client-options=$client_options && | ||
echo "Running opensearch-benchmark w/ 'nyc_taxis' workload..." && | ||
opensearch-benchmark execute-test --distribution-version=1.0.0 --target-host=$endpoint --workload=nyc_taxis --pipeline=benchmark-only --test-mode --kill-running-processes --workload-params "target_throughput:0.5,bulk_size:10,bulk_indexing_clients:1,search_clients:1" --client-options=$client_options | ||
} | ||
|
||
should_generate_data=$1 | ||
|
||
if [[ "$should_generate_data" == true ]]; then | ||
/usr/local/bin/docker-entrypoint.sh eswrapper & echo $! > /tmp/esWrapperProcess.pid && sleep 10 && generate_data_requests | ||
else | ||
mkdir -p /usr/share/elasticsearch/data/nodes | ||
fi |
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generate data isn't quite right since the user may never see generation happen (if they're using a cached layer).
From a docker image perspective, one has OSB datasets from an OSB run and the other doesn't. I would call this something like dataset=osb_4testWorkloads or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have changed this to be similar to
dataset=osb_4testWorkloads