Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add Mysql2 and Trilogy db.collection.name attribute #1109

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

hannahramadan
Copy link
Contributor

@hannahramadan hannahramadan commented Aug 8, 2024

The db.collection.name attribute is conditionally required for Database spans. This PR uses regex to detect the collection name and report it as the db.collection.name attribute for Mysql2 and Trilogy instrumentation.

Closes #1100

@hannahramadan hannahramadan changed the title Add Mysql2 and Trilogy db.collection.name attribute Feat: Add Mysql2 and Trilogy db.collection.name attribute Aug 8, 2024
Copy link
Contributor

@kaylareopelle kaylareopelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small wording things on the trilogy test instructions you can keep or leave. Other than that, looks great! Thank you!

hannahramadan and others added 2 commits August 20, 2024 11:58
Co-authored-by: Kayla Reopelle <87386821+kaylareopelle@users.noreply.github.com>
@@ -83,9 +86,17 @@ def _otel_client_attributes

attributes[SemanticConventions::Trace::DB_NAME] = _otel_database_name
attributes[SemanticConventions::Trace::PEER_SERVICE] = config[:peer_service]
attributes['db.collection.name'] = collection_name(sql)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid adding additional overhead to the instrumentation if possible.

In cases where the SQL statement is omitted or not sanitized, there isn't any regexp scan that occurs.

Is there a way to optimize it so that there isn't additional processing?

I would also like to consider that this db.collection.name matches newer versions of the OTel Schema and the instrumentations are still using pre-1.0 semantics.

If I am not mistaken that would have been db.table.name. I think we should continue using pre-1.0 semantics until we have
OTEL_SEMCONV_STABILITY_OPT_IN implemented

https://opentelemetry.io/docs/specs/semconv/database/database-spans/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arielvalentin! It makes sense to keep consistent with pre-1.0 semantics. In this case, the attribute is db.sql.table. I've made this update!

Re additional processing: I see your point about putting every SQL statement through a regexp scan. Because the table/collection name isn't available on the client, I'm not sure how else we'd be able to get this information. We could make table name an opt-in/opt-out attribute, but is that level of control something we want to provide? What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it's difficult to say. I feel like maybe this should only be done when the we also include the db.statement and I don't know if it would be possible to combine with the obfuscation code.

Maybe I'm making too much of it? I'm not sure.

Perhaps benchmarking will put me a bit at ease.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if only including the table name if we're recording the db.statement deviates from the convention (because as of now, the convention has it that recording the collection name isn't based on any condition, just availability).

The change to Regexp.last_match(1) and an updated regex string improved speed. In the following benchmark example, I used the MySQL obfuscation SQL code as a baseline, and found running the extra db.collection.name sql was 1.19x slower.

require 'benchmark/ipsa'

TABLE_NAME = /\b(?:(?:FROM|INTO|UPDATE)|(?:(?:CREATE|DROP|ALTER)\s+TABLE(?:\s+IF\s+(?:NOT\s+)?EXISTS)?))\s+["']?([\w.]+)["']?/i
MYSQL_OBFUSCATION = /(?-mix:'(?:[^']|'')*?(?:\\'.*|'(?!')))|(?-mix:"(?:[^"]|"")*?(?:\\".*|"(?!")))|(?-mix:(\$(?!\d)[^$]*?\$).*?(?:\1|$))|(?-mix:\{?(?:[0-9a-fA-F]\-*){32}\}?)|(?-mix:-?\b(?:[0-9]+\.)?[0-9]+([eE][+-]?[0-9]+)?\b)|(?i-mx:\b(?:true|false|null)\b)|(?-mix:0x[0-9a-fA-F]+)|(?i-mx:(?:#|--).*?(?=\r|\n|$))|(?m-ix:\/\*.*?\*\/)|(?-mix:q'\[.*?(?:\]'|$)|q'\{.*?(?:\}'|$)|q'\<.*?(?:\>'|$)|q'\(.*?(?:\)'|$))/
SQL = 'SELECT * FROM test_table'

def collection_name
  shared_operation
  Regexp.last_match(1) if SQL =~ TABLE_NAME
end

def no_collection_name
  shared_operation
  SQL
end

def shared_operation
  SQL.gsub(MYSQL_OBFUSCATION, '?')
end

Benchmark.ipsa do |x|
  x.report('collection_name') { collection_name }
  x.report('no_collection_name') { no_collection_name }

  x.compare!
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielvalentin We chatted about this in the SIG today and decided the reporting of table/collection names can be put behind a feature flag. What do you think? If that works, the remaining consideration is if the default on or off - do you have any preference on this?

Copy link
Contributor Author

@hannahramadan hannahramadan Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielvalentin - I made a suggestion for the config name db_collection_name and default value include, with omit as the other option: 8328668

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arielvalentin! Wanted to check in and see if you had thoughts on the above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience. I'm going to add this to my list to review by EoD tomorrow

@arielvalentin
Copy link
Collaborator

@hannahramadan Thank you for your patience waiting for my response.

We discussed some of the details during the SIG on 2024-10-08 which I will share here:

Default to omit

Adding additional regular expressions to extract data adds more overhead than I would like in high volume systems that are performance sensitive. I would like to make it so that users opt-in to this attribute.

Using Mixed Schema Versions

I think I mentioned this already in a separate comment, but I believe that our instrumentations are still emitting pre-1.0 schema attributes. I would have expected the attribute name to be db.sql.table and then once we had the DB Semconv Stability functionality it would emit db.collection.name and/or db.sql.table depending on the user selected option.

I am of the opinion it may be a bit more confusing in this case to have the db.statement attribute along with the db.collection.name, when users I think would expect to see db.query.text instead; so, I think it is best to align on using deprecated attributes until we introduce the use of semconv stability.

Is that not what we want?

Config options are tied to a Schema Version

I think we set a bad precedent with setting individual attribute options because the names are now tied to a specific schema version. I am not certain what to do about that.

Do we continue to have options when we move things along?

Remove Structural Duplication

Identical code is now going to appear in multiple instrumentations, which means that if we need to make changes it will require applying them to multiple locations.

I think that it would be best to extract this functionality into a common library that could be shared across DB instrumentations.

Though the gem is currently named sql-obfuscation I think it is the best place to include this functionality, where instead of it solely being responsible for sanitizing SQL, the gem could provide a helper that extracts or enriches a DB client span. It could use the instrumentation config to make decisions about what attributes to include and what logic to apply.

Something like this:

  SqlProcessor.append_db_attributes(span, sql, config)

Given that this is more of a client side processor than it is a query sanitizer, then I think we should rename the gem to reflect that it is processing SQL and generating attributes for it.

An additional benefit to removing structural duplication is that it would reduce the number of places where we would have to add any DB attribute semconv stability compatibility.

Are Span Processors out of the question?

This is something we did not discuss.

With the introduction of OnEnding, the specification leaves room for us to do some Span Processing outside of the instrumentation code.

This would mean that instrumentations could all emit the SQL and then the SqlProcessor could be run in OnEnding and potentially reduce some of the complexity in the DB instrumentations.

The processor could extract attributes from the SQL and sanitize or omit the statements. This would change configurations so that these options could not be configured on the instrumentations anymore but rather in the span processor.

I think that trade off may be acceptable since it is unlikely we will want to have different configurations for different SQL datastores.

All that being said

We agreed that I would unblock this on the condition that this attribute be omitted by default and move forward with structural duplication and refactor the code in a future PR.

@hannahramadan hannahramadan requested review from a team as code owners October 14, 2024 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add db.collection.name to mysql-based instrumentation libraries
4 participants