Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove baggage propagation from xray propagator #1178

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jamesmoessis
Copy link

Description:

This PR removes baggage propagation from the X-Ray propagator.

Existing Issue(s):

We've seen strong incompatibilities between the OpenTelemetry/W3C baggage specification, and that which X-Ray allows.

The primary incompatibility is that the limit of 256 characters in the Amazon Trace header. This is incompatible with the W3C baggage limit of 8192 bytes. Attempting to truncate the baggage to fit within this limit risks corrupting the entries and should not be attempted.

We've seen issues with baggage propagation in X-Ray, for example here: aws-observability/aws-otel-java-instrumentation#563

Additionally, other language's xray propagator do not attempt to attach baggage, for example:

Testing:

Removed tests related to baggage.

@wangzlei
Copy link

wangzlei commented Feb 2, 2024

Thanks @jamesmoessis raise the issue.

Other than baggage length 8192 bytes problem, could you help me understand what is other problem? awsxraypropagator is a short term solution before x-ray fully migrate to w3c, we want to understand the customer impact.
aws-observability/aws-otel-java-instrumentation#563 (comment)

@jamesmoessis
Copy link
Author

jamesmoessis commented Feb 5, 2024

@wangzlei

Other than baggage length 8192 bytes problem, could you help me understand what is other problem?

It's mainly the length problem.

Perhaps also another problem is the lack of a proper specification for this header, and disparity between different OpenTelemetry implementations of it as a result.

Particularly the issue is that Java AWS SDK v2 instrumentation hardcodes the use of this propagator, but also allows other propagators to be used in the case of SQS messaging. In the case of SQS, I'd prefer my W3C propagator to be handling the baggage, without risk of corruption from the xraypropagator.

awsxraypropagator is a short term solution before x-ray fully migrate to w3c, we want to understand the customer impact.

The customer impact is that the xray propagator causes 400s to AWS and has the potential to corrupt baggage. I'm excited for when Amazon fully adopt W3C, but until then, the xraypropagator's handling of baggage is harmful, and it's specific to the Java implementation. For the short-term solution, there should be no baggage.

@jamesmoessis
Copy link
Author

@wangzlei any updates or thoughts?

Are we happy causing 400s and corrupting baggage? It's a blocker for using AWS SDK instrumentation since the propagator is hardcoded into it.

@jamesmoessis
Copy link
Author

Not getting much from the code owners here, wonder if maintainers have any thoughts @jack-berg @trask?

@jack-berg
Copy link
Member

@jamesmoessis I added it to the agenda for the 4/11/24 Java SIG. Hopefully we can figure out how to proceed.

@jamesmoessis
Copy link
Author

Thankyou @jack-berg appreciate it. I would come to the SIG but unfortunately it's at 2am for me 😴 I'll keep an eye on the SIG meeting notes too

@jamesmoessis jamesmoessis force-pushed the remove-baggage-from-xray-propagator branch from c23766d to 4dde536 Compare April 10, 2024 01:47
}
// Size is key/value pair, excludes delimiter.
int size = key.length() + entry.getValue().length() + 1;
if (baggageWrittenBytes + size > 256) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary incompatibility is that the limit of 256 characters in the Amazon Trace header. This is incompatible with the W3C baggage limit of 8192 bytes. Attempting to truncate the baggage to fit within this limit risks corrupting the entries and should not be attempted.

It looks like the code is only trying to include baggage entires if the entire entry can be included without exceeding 256 bytes.

When you talk about corrupting entries, are you referring to some entries being present while others not? Can you help me understand why this is problematic?

.append(samplingFlag);

Baggage baggage = Baggage.fromContext(context);
// Truncate baggage to 256 chars per X-Ray spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find a spec document for the xray spec format. If there is a document, and it does talk about including baggage like this, we should likely follow the spec.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of a formal specification as well, but I usually follow this reference: https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html#Tracing-header:~:text=sampling%20rules.-,Tracing%20header,-All%20requests%20are

It seems that there is no equivalent concept to baggage, at least in the reference above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reference to baggage here, so maybe its right to delete it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to the comment, using key "XRAY_HEADER_ADDITIONAL_FIELDS"

.append(samplingFlag);

Baggage baggage = Baggage.fromContext(context);
// Truncate baggage to 256 chars per X-Ray spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this comment:

Regarding the length of the X-Ray header, I did not find a definition in X-Ray spec. However, based on some clues, it can be inferred that StepFunction likely has a contract with X-Ray, hence there is length checking in the StepFunction client API. So it is reasonable to truncate the baggage length to (256 - the length of the existing trace context).

My read of this is that including up to 256 baggage bytes is a bug that should be fixed, instead of removing baggage encoding altogether.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. #563 is StepFunction receives a xray header over 256. AWS Lambda is working on a new encoding algorithm to reducing the xray header to be less than 256, then we can fix the bug by truncating xray header to be 256, not truncating baggage to 256.

@trask
Copy link
Member

trask commented Apr 11, 2024

discussed in today's SIG meeting:

  • we'd like to use this PR as a forcing function to ensure we have responsive component owners for the aws xray propagator
  • @rapphil is going to check inside AWS about finding new component owner(s)
  • if it doesn't look like that's going to be fruitful in the next 2 weeks, we'll look for new component owners within the broader OpenTelemetry community

@wangzlei
Copy link

We have internal discussion with @rapphil today. To avoid conflicts with the baggage propagator, xray propagator can update the code to use a dedicated key instead of ImplicitContextKeyed Baggage

@@ -241,9 +199,6 @@ private static <C> Context getContextFromHeader(
if (spanContext.isValid()) {
context = context.with(Span.wrap(spanContext));
}
if (baggage != null) {
context = context.with(baggage.build());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a new key "XRAY_HEADER_ADDITIONAL_FIELDS"

context = context.with(XRAY_HEADER_ADDITIONAL_FIELDS ,baggage.build());

FYI #1178 (comment)

@jamesmoessis
Copy link
Author

Sorry folks, haven't had time to get back to this but I would like to as soon as I get the time. If someone else wants to pick this up and fix the problem, feel free. Otherwise I will try to find some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants