Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: plugin that build failed but still be assembled #5091

Open
ruanyl opened this issue Oct 10, 2024 · 11 comments
Open

[Bug]: plugin that build failed but still be assembled #5091

ruanyl opened this issue Oct 10, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@ruanyl
Copy link
Member

ruanyl commented Oct 10, 2024

Describe the bug

Checking this pipeline on build.sh step: https://build.ci.opensearch.org/blue/organizations/jenkins/distribution-build-opensearch/detail/distribution-build-opensearch/10377/pipeline/151
The build of security-analytics was failed:

2024-10-10 02:27:14 ERROR    ERROR: Command 'bash /tmp/tmps7v24_y5/security-analytics/scripts/build.sh -v 3.0.0 -p linux -a x64 -s false -o builds' returned non-zero exit status 1.
2024-10-10 02:27:14 ERROR    Error building security-analytics, retry with: ./build.sh manifests/3.0.0/opensearch-3.0.0.yml --component security-analytics

However, the plugin was installed in step assemble.sh in https://build.ci.opensearch.org/blue/organizations/jenkins/distribution-build-opensearch/detail/distribution-build-opensearch/10377/pipeline/963

2024-10-10 02:35:50 INFO     Installing security-analytics
2024-10-10 02:35:50 INFO     Executing "/tmp/tmpmiej8_16/opensearch-3.0.0/bin/opensearch-plugin install --batch file:/tmp/tmpmiej8_16/opensearch-security-analytics-3.0.0.0.zip" in /tmp/tmpmiej8_16/opensearch-3.0.0
-> Installing file:/tmp/tmpmiej8_16/opensearch-security-analytics-3.0.0.0.zip
-> Downloading file:/tmp/tmpmiej8_16/opensearch-security-analytics-3.0.0.0.zip
-> Installed opensearch-security-analytics with folder name opensearch-security-analytics

Shouldn't the plugin be excluded if it failed to build?

I'm having runtime issue now running 2.18.0 and 3.0.0 docker image which looks related:

[2024-10-10T03:45:39,749][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-master-0] fatal error in thread [main], exiting
java.util.ServiceConfigurationError: org.apache.lucene.codecs.Codec: Provider org.opensearch.securityanalytics.correlation.index.codec.correlation950.CorrelationCodec950 could not be instantiated
	at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:586) ~[?:?]
	at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:813) ~[?:?]
	at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:729) ~[?:?]
	at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1403) ~[?:?]
	at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:68) ~[lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
	at org.apache.lucene.codecs.Codec.reloadCodecs(Codec.java:136) ~[lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
	at org.opensearch.plugins.PluginsService.reloadLuceneSPI(PluginsService.java:767) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.plugins.PluginsService.loadBundle(PluginsService.java:719) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.plugins.PluginsService.loadBundles(PluginsService.java:545) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.plugins.PluginsService.<init>(PluginsService.java:197) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.node.Node.<init>(Node.java:524) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.node.Node.<init>(Node.java:451) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:181) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-3.0.0.jar:3.0.0]
	at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-3.0.0.jar:3.0.0]
	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104) ~[opensearch-3.0.0.jar:3.0.0]
Caused by: java.lang.NoClassDefFoundError: org/apache/lucene/codecs/lucene99/Lucene99Codec
	at org.opensearch.securityanalytics.correlation.index.codec.correlation950.CorrelationCodec950.<clinit>(CorrelationCodec950.java:14) ~[?:?]
	at java.base/jdk.internal.misc.Unsafe.ensureClassInitialized0(Native Method) ~[?:?]
	at java.base/jdk.internal.misc.Unsafe.ensureClassInitialized(Unsafe.java:1160) ~[?:?]

To reproduce

Run 2.18.0 and 3.0.0 opensearch docker image

Expected behavior

No response

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

No response

Additional context

No response

Relevant log output

No response

@ruanyl ruanyl added bug Something isn't working untriaged Issues that have not yet been triaged labels Oct 10, 2024
@gaiksaya gaiksaya removed the untriaged Issues that have not yet been triaged label Oct 10, 2024
@peterzhuamazon
Copy link
Member

This is the same issue I described here:

The are multiple things happening here:

  1. In build workflow, if you include incremental then a previous success copy of artifacts will be pulled from S3.
  2. In ideal scenario if any plugin rebuilds, and failed, the run should stop there preventing a previous success copy to be used when a current build failed
  3. In reality continue-on-error was co-enabled with incremental, so that if a plugin build failed, it will move on to the next plugin without failing the whole pipeline.
  4. This creates a weird scenario where pluginA failed, but its previous good copy is on disk due to incremental, caused the build recording to record it into the build manifest, and the build recorder is in action because pipeline is not failling due to continue-on-error.
  5. Assemble workflow starts with build manifest parsing, and will include the past success version as the current success version for a plugin that failed the current build and should be marked as failure.
  6. Another edge case is someone didnt include pluginA in input manifest in 2.18.0, but due to this pluginA has a good copy in the previous 2.17.1, it is still pulled into the build from S3 by incremental, and treated as success in the build recorder and assemble workflow.

Involve @zelinh again to see if there is any better way to solve this.
Probably remove the zips that is not in input manifest and the zips that is meant to be rebuild, to avoid cache polluting the new builds.

Thanks.

@gaiksaya
Copy link
Member

I believe it was by design to include the previously built component (using incremental) if the new commit build for that plugin is failing. We could still have a complete bundle using previous commit which is very much nightly built artifact trait.
Logging the failure needs to be better to get an idea what is being installed. If SA failed to built and previous copy is being installed that is expected and should be okay but needs to be informed to the user. Incremental and continue-on-error can go hand in hand. We do not want to fail entire workflow for a single component but also install if previous copy exists.

Also adding @dblock to get some suggestion on what should be the better approach.

@ruanyl
Copy link
Member Author

ruanyl commented Oct 11, 2024

@peterzhuamazon Thanks!

Probably remove the zips that is not in input manifest and the zips that is meant to be rebuild, to avoid cache polluting the new builds.

  1. Will the build manifest contain the zips of the plugins which failed to build?
  2. Isn't the zips uploaded to S3 are "versioned" by the build number? I can see the base url is https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/3.0.0/10377/linux/x64 guess this is where the zips are stored? If this is true, how does it resolve to an earlier zip when the current build is failed?

@prudhvigodithi
Copy link
Collaborator

prudhvigodithi commented Oct 11, 2024

Some more discussion related to the same topic is part of this issue opensearch-project/opensearch-build-libraries#455.

@ruanyl
Copy link
Member Author

ruanyl commented Oct 11, 2024

Hi @gaiksaya, thanks!

I believe it was by design to include the previously built component (using incremental) if the new commit build for that plugin is failing.

I kinda get the point of doing this as We could still have a complete bundle. But just feel a broken docker image is worse than missing certain features. Perhaps for most cases using a previously build component won't result in runtime error, this is why it's by design? Btw, shall we also publish a docker image tag with build number? So that people can easily revert when encountering issue.

@gaiksaya
Copy link
Member

gaiksaya commented Oct 14, 2024

[Triage]
Previous discussion for this behavior opensearch-project/opensearch-build-libraries#455 (comment)
Nightly artifacts are expected to be unstable/broken. That's how we catch issues and raise them with component teams. We are working on adding smoke tests at the distribution level that would detect if the given artifact is valid or not. Long term plan can be to put those artifacts under something /valid per version.
Adding @zelinh who is working on smoke testing framework.

@ruanyl
Copy link
Member Author

ruanyl commented Oct 15, 2024

Nightly artifacts are expected to be unstable/broken.

Thanks @gaiksaya, that's faire point. When pushing the dock image tag, does it make sense to push a tag with the build number? That helps to revert to a previous valid version. Or any suggestion on how to revert now?

@gaiksaya
Copy link
Member

What is the use-case here? Where are the docker images being used?

@ruanyl
Copy link
Member Author

ruanyl commented Oct 15, 2024

@gaiksaya I'm using docker image from https://hub.docker.com/r/opensearchstaging/opensearch, we use 3.0.0(main) or the current 2.18.0(2.x) to setup clusters for development/testing/demo env for OSD features on main/2.x branch.

@gaiksaya
Copy link
Member

I would recommend to use validation workflow present in this repo to make sure the artifacts that you are deploying are valid. We are using similar one in nightly playgrounds workflow.
However, recently I encountered a bug related to OSD #5117

@dblock
Copy link
Member

dblock commented Oct 24, 2024

Related, #5130.

I think as a consumer of any docker staging build I'd like to know:

  1. What are the plugins that built successfully and were included in the build.
  2. What are the plugins in the situation described in this issue, aka didn't build but a previous version was included.
  3. Overall, is this a complete build without failures, meaning a potential beta/demo/release candidate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Now(This Quarter)
Development

No branches or pull requests

5 participants