-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replayer instrumentation #475
Replayer instrumentation #475
Conversation
…thod for others to leverage. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Most of this is just playing, but making the StreamManager implement AutoCloseable gives a place to end spans to show how long a serializer/connection factory was relevant for. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…to the collector to prometheus, zipkin, etc Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…e config hierarchy. This was broken from the merge https://github.com/opensearch-project/opensearch-migrations/pull/376/files#diff-430f89dc33402ecf692b9a8372f66e585bb2f9215596433216580efc2a56795c. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…lotted within the same graph in prometheus. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…ics into an optional. Dropping the optionals makes the code simpler and if we don't want to do logging, we can just not fill in the configuration for the SDK. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…troduce some more typesafe wrappers for contexts. Lots more to come. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…explicitly passing strongly typed context objects. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Make sure that the context is using the right requestKey, which also will have the appropriate indices as per the test context. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…ayerRequestContext to the replayer Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Don't bother showing the Kakfa offloader just buffering (was called recordStream). Now the offloader span is a child span of the connection span from the handler, so we can see the handler gathering the request/response (or waiting for the response). Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
That makes it a separate state for the logging handler superclass. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…rocessor. Prometheus metrics already have an export_name that is unique, the processors weren't doing anything useful, & the namespace was appending EVERYTHING from one of the two services. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…d (less so for now) metrics can be exported across more of the lifetime of a request/connection. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
… a test bug. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com> # Conflicts: # TrafficCapture/nettyWireLogging/src/main/java/org/opensearch/migrations/trafficcapture/netty/ConditionallyReliableLoggingHttpRequestHandler.java # TrafficCapture/nettyWireLogging/src/main/java/org/opensearch/migrations/trafficcapture/netty/LoggingHttpRequestHandler.java # TrafficCapture/nettyWireLogging/src/test/java/org/opensearch/migrations/trafficcapture/netty/ConditionallyReliableLoggingHttpRequestHandlerTest.java # TrafficCapture/trafficCaptureProxyServer/src/main/java/org/opensearch/migrations/trafficcapture/proxyserver/netty/ProxyChannelInitializer.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/Accumulation.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/CapturedTrafficToHttpTransactionAccumulator.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/RequestResponsePacketPair.java # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/RequestSenderOrchestrator.java # TrafficCapture/trafficReplayer/src/test/java/org/opensearch/migrations/replay/SimpleCapturedTrafficToHttpTransactionAccumulatorTest.java
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…when the request was ignored and remove the now-unused file for responses. I'd like to revisit this eventually to make sure that it's as efficient as possible and to organize it better. However, it does get the job done for now and tests were updated to confirm. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…uests and responses. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…ifferent levels of them. This has the minimal amount of work to get those relationships to simply compile. Nearly every unit test fails and the code is more clunky than it needs to be, but getting to this point was alone a major lift. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…in, but the replayer isn't crashing. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…scovered by trace inspection. 1) close was called AFTER the RRPair was rotated, so there were no traffic streams being committed, resulting in a perpetual hole in the commit log. 2) close was being scheduled immediately, before all requests (in most cases) because the channelInteractionNumber was being miscalculated. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…using the same ConnectionReplaySession to be returned for every connection, which resulted in serious corruption of ordering and tests never completing. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com> # Conflicts: # TrafficCapture/captureOffloader/src/test/java/org/opensearch/migrations/trafficcapture/StreamChannelConnectionCaptureSerializerTest.java # TrafficCapture/dockerSolution/src/main/docker/migrationConsole/runTestBenchmarks.sh # deployment/cdk/opensearch-service-migration/lib/service-stacks/migration-analytics-stack.ts
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
82b71cc
to
9a31210
Compare
return getExponentialBucketsBetween(firstBucketSize, lastBucketCeiling, 2.0); | ||
} | ||
|
||
private static List<Double> getExponentialBucketsBetween(double firstBucketSize, double lastBucketCeiling, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend this to work with negative numbers? Say bucket boundaries of [-2, 2]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
...icCapture/coreUtilities/src/main/java/org/opensearch/migrations/tracing/RootOtelContext.java
Show resolved
Hide resolved
...icCapture/coreUtilities/src/main/java/org/opensearch/migrations/tracing/RootOtelContext.java
Show resolved
Hide resolved
double firstBucketSize, double lastBucketCeiling) { | ||
var buckets = getExponentialBucketsBetween(firstBucketSize, lastBucketCeiling, 2.0); | ||
log.atInfo().setMessage(() -> "Setting buckets for " + activityName + " to " + | ||
buckets.stream().map(x -> "" + x).collect(Collectors.joining(",", "[", "]"))).log(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this actually printing? Are we accurately showing what's a < end vs <=?
…t constructor override, throw an IllegalArgumentException. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…losed. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address CI/CD build failure
1a830f3
to
66dcad2
Compare
Bugfix is in the capture proxy's channel context's `sendMeterEventsForEnd()` override to call super so that we'll pickup duration, etc metrics too. The lint fix is in a new python script to facilitate testing from the migration console. The test fix is to give each TrafficReplayer run a fresh TestContext. That context includes channel contexts, which should not be reused across process boundaries and likewise shouldn't be getting reused if we're trying to simulate that for repeated runs. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…en though the channel context was. Also make minor fixes to other tests to make them more resilient. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
66dcad2
to
16a9010
Compare
…mplicitly manage them within a ChannelContext. I've also made a couple more test and production data structures more threadsafe. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…ks in test code. The SocketContext is now closed in the callback for the channel close rather than before we call close(). That should make a test failure due to duplicate close() calls much less likely and should also give us a better idea of when the socket was actually closed. There were some OutOfMemoryErrors coming back from the github action after 10 minutes of testing. I believe that this was due to having a number of InMemoryMetricExporters that periodic metric exporters were pumping to (in perpetuity, even after the test was complete). We were also potentially tracking lots of backtraces in the ContextTrackers. Both of these now have close() calls that clears all of that logged data. That's now called by TestContext.close(), which is wired for each InstrumentationTest. The next commit will tie off a lot more loose ends, but this commit was tested more extensively, hence the reason that I'm keeping them separate. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…nstants, even if an Instant is what's passed in to the constructor Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…tributes end up as annotations instead of metadata so that they can be searched. To search in x-ray, use `annotationId.ATTRIBUTE_NAME="..."`. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Description
See the README included with this PR.
TLDR - lots of proxy and replayer code now passes contexts throughout the call graph. Those Instumented Contexts are invoked to signal different pieces of progress, which are then turned into metrics and traces that are sent to an otel-collector. The otel collector is similar to what it was before this PR, but the configurations and its deployment, both in docker and for AWS, have been slightly adjusted.
For AWS Cloud deployments, we now configure CloudWatch and X-Ray to received metrics and traces (via OTEL's Collector). For the docker solution, we use Prometheus and Jaeger ("all-in-one") to store metric and trace data. A Grafana container is also started alongside the other containers to allow for discovery and (eventually) dashboards.
Issues Resolved
https://opensearch.atlassian.net/browse/MIGRATIONS-1438
Is this a backport? If so, please add backport PR # and/or commits #
Testing
Unit testing + E2E testing
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.