Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

612 Introduced RECAP Search Alerts sweep index #4127

Merged
merged 44 commits into from
Oct 19, 2024

Conversation

albertisfu
Copy link
Contributor

@albertisfu albertisfu commented Jun 21, 2024

This PR introduces the Sweep index approach (discussed in #612) to send RECAP Search Alerts that might have been missed by the percolator approach during the day.

The cl_send_recap_alerts command will perform the following tasks:

  1. Remove the RECAPSweepDocument index from the previous day and recreate it to get a clean index (I'll create a follow-up issue to apply the logic to store the last 7-14 days indices for debugging purposes).

  2. Index all documents added/changed during the day from the main RECAP index to the RECAPSweepDocument. The indexing process uses the ES re-index API with a custom query for efficiency. The documents included in the re-index are:

    • Dockets added/modified during the day.
    • All RECAPDocuments from dockets added/modified during the day.
    • Independent RECAPDocuments added/modified during the day.
    • Dockets that are parents of RECAPDocuments added/modified independently during the day.

    This ensures that every document that should be included in the day's alerts is indexed.

  3. Considering that the re-index process can take considerable time depending on the number of documents, it is scheduled as an ES task. This task retrieves an ID to monitor its progress. Initially, the process will wait for one minute after the task is scheduled. Depending on the task's progress, the estimated waiting time will dynamically change before checking the task status again, repeating the process until the task is completed. If after 10 failed tries of getting the task status (possible due to a ES cluster overhead), the process is aborted and a error is logged so we can take an action manually.

  4. Some variables are stored in Redis to make the command resumable in case of a failure or if the Pod dies:

    • alert_sweep:re_index_completed: If the command fails after the re-index process is completed, this step can be skipped when the command is resumed.
    • alert_sweep:query_date: This stores the date from which we are sending alerts. The idea is that the command can be started during the day we are sending alerts, and in case it is close to 00:00 of the next day, it will still know that the alerts belong to the previous day in the event of a failure occurring after midnight.
    • alert_sweep:task_id: In case of failure during the ES re-index process, monitoring can continue from the previously scheduled task instead of starting a new one.
  5. Send RT and Daily alerts. After the re-index process is completed, Real-time and Daily alerts are sent. The process is the same for both rates:

    • Get alerts for each user.

    • Filter out RT alerts for non-members.

    • Search each query alert against the RECAPSweepDocument and retrieve the hits.

    • Each query is limited to retrieving up to SCHEDULED_ALERT_HITS_LIMIT hits, defaulting to 20 (the current max number of hits in ES OA Search Alerts).

    • Hits are processed to ensure only the correct hits are included and no hits/alerts are duplicated:

      • According to the alert query and the hits returned, the alert can be classified within three groups:
        • Docket-only alert: Queries that don't include any child filter or child field in the text query using advanced search syntax. If the hit contains child documents (matched by a docket field), child documents are filtered, ensuring no child field was highlighted. This helps differentiate whether an alert is Docket-only.
        • RECAP-only query: Alerts that can only match RECAPDocuments, such as if only child filters are used or if the text query only matches a child field like description or plain_text.
        • Cross-object query: Alerts that match both Docket and RECAPDocument fields, such as a combination of a case_name and document_number filter, or a text query matching case_name and plain_text simultaneously.

      The reason to differentiate alert types is to avoid sending alerts incorrectly based on matched content.

      In practice, we only need to differentiate Docket-only alerts from RECAP-only or Cross-object alerts. If a hit in an alert doesn't include RD fields in the query or filters and the hit doesn't match RD highlights (Docket-only), we need to ensure the matched Docket was added/updated during the day. This ensures the alert should be triggered and no child hits are included. To confirm that, the Docket date_modified must belong to the same day, indicating the Docket was added or modified that day. We want to avoid cases where a Docket is indexed due to one of its RECAPDocuments being updated independently.

      For RECAP-only and Cross-object alerts, RECAPDocuments are matched as inner hits. The filtering process confirms that the query contains a child field as a filter or within the text query using advanced syntax, or if a child field is highlighted. If true, the child hit is included in the alert.

  6. An additional filter checks if the Docket hit or the RECAPDocument hit has already triggered the same alert. We keep two sets per alert:

    • alert_hits:id.d stores Dockets that have triggered an alert.
    • alert_hits:id.r stores RECAPDocuments that have triggered the alert.

    For Docket-only alerts, we check if the Docket hit ID is already within alert_hits:id.d. If so, the hit is excluded from the alert. For RECAP-only or Cross-object alerts, we check if the RD hits are within alert_hits:id.r. Only RDs not in the set are included in the alert. If all RD hits have previously triggered the alert, the hit is omitted from the alert.

  7. Finally, after filtering hits and child hits, alert emails are sent, along with their related webhooks.

  8. WLY and MLY rates:

    • The process is similar to that described for RT and DLY rates for filtering hits. However, emails are not sent immediately. Instead, they are stored as ScheduledAlertHit to be sent according to their rate by the cl_send_scheduled_alerts command.
    • WLY and MLY webhooks are sent immediately, similarly to how WLY and MLY ES OA alerts are triggered by the percolator.
  9. The Alerts UI is enabled for RECAP Search behind a waffle flag:

alerts_ui

Here are some examples of alert emails:

  • Docket-only alert: Only the Docket is included in the alert with no child hits.
    rt-docket-only-query

  • RECAPDocument-only alert: The docket fields are shown with the RECAPDocument nested below the docket fields.
    dly_recap_only_query

  • Cross-object alert: Here we can see how multiple cases are includes in the alert. In this case, the cross-object alert matched a hit by its case_name and also matched a RD belonging to the case that included the keywords within the document description. The second hit only matched the case by its case_name with no RDs matched.

Screenshot 2024-07-03 at 6 52 36 p m

Also you can notice the View Additional Results for this case is shown in the first case.

This is because the original search matched more RECAPDocuments due to the case_name being indexed into each RD, which is the behavior in the frontend.
Screenshot 2024-07-03 at 6 53 13 p m

In the alert, we filter out the RDs that actually matched the alert.
One question here: should we keep the View Additional Results button as in the frontend, or only show it if there are still 5 RDs matched after filtering? It's important to note that when clicking that button in the frontend, more results can be shown since it includes RDs filtered from the alert.

Notes and additional questions:

  • Highlighted fields are the same as in the frontend.
  • Let me know if any other Docket or RECAPDocument fields should be included in the alerts.
  • What would be the change in the subject or content of the RT alerts missed by the percolator that are sent by the sweep index?
  • Should we trigger the alert if it’s a cross-object alert that matches both a Docket and RDs that have been updated durint the same day but all the RDs matched have already triggered that alert? Should we send the alert only including the Docket, or omit it? Considering it is a cross-object alert, it is expected to see both the Docket and RDs in the alert.

{% endif %}
{% if doc.plain_text %}
{% contains_highlights doc.plain_text.0 True as highlighted %}
<span style="display: block; margin-top: 5px;">{% if highlighted %}&hellip; {% endif %}{{ doc.plain_text|render_string_or_list|safe|underscore_to_space }} &hellip;</span>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.

Ignore this finding from template-unescaped-with-safe.

<a href="https://www.courtlistener.com{% if doc.absolute_url %}{{ doc.absolute_url }}{% else %}{{ result.docket_absolute_url }}#minute-entry-{{ doc.docket_entry_id }}{% endif %}" class="visitable">{% if doc.short_description %}{{ doc.short_description|render_string_or_list|safe }}<span class="gray">&nbsp;&mdash;&nbsp;</span>{% endif %}Document #{% if doc.document_number %}{{ doc.document_number }}{% endif %}{% if doc.attachment_number %}, Attachment #{{ doc.attachment_number }}{% endif %}
</a>
{% if doc.description %}
<span style="display: block; margin-top: 5px;">Description: {{ doc.description|render_string_or_list|safe }}</span>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.

Ignore this finding from template-unescaped-with-safe.

{% for doc in result.child_docs %}
{% with doc=doc|get_attrdict:"_source" %}
<li>
<a href="https://www.courtlistener.com{% if doc.absolute_url %}{{ doc.absolute_url }}{% else %}{{ result.docket_absolute_url }}#minute-entry-{{ doc.docket_entry_id }}{% endif %}" class="visitable">{% if doc.short_description %}{{ doc.short_description|render_string_or_list|safe }}<span class="gray">&nbsp;&mdash;&nbsp;</span>{% endif %}Document #{% if doc.document_number %}{{ doc.document_number }}{% endif %}{% if doc.attachment_number %}, Attachment #{{ doc.attachment_number }}{% endif %}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.

Ignore this finding from template-unescaped-with-safe.

{{ forloop.counter }}. {{ result|get_highlight:"caseName"|safe }}
({% if result.court_id != 'scotus' %}{{ result|get_highlight:"court_citation_string"|nbsp|safe }}&nbsp;{% endif %}{% if type == 'o' %}{{ result.dateFiled|date:"Y" }}{% elif type == 'oa' %}{{ result.dateArgued|date:"Y" }}{% endif %})
({% if result.court_id != 'scotus' %}{{ result|get_highlight:"court_citation_string"|nbsp|safe }}&nbsp;{% endif %}{% if type == 'o' %}{{ result.dateFiled|date:"Y" }}{% elif type == 'oa' %}{{ result.dateArgued|date:"Y" }}{% elif type == 'r' %}{{ result.dateFiled|date:"Y" }}{% endif %})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.

Ignore this finding from template-unescaped-with-safe.

Copy link

semgrep-app bot commented Jun 28, 2024

Semgrep found 1 template-unescaped-with-safe finding:

  • cl/alerts/templates/alert_email_es.html

Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.

Ignore this finding from template-unescaped-with-safe.

Copy link

semgrep-app bot commented Jul 2, 2024

Semgrep found 6 baseclass-attribute-override findings:

Class RECAPSweepDocument inherits from both DocketDocument and ESRECAPDocument which both have a method named prepare_trustee_str; one of these methods will be overwritten.

Ignore this finding from baseclass-attribute-override.

@albertisfu albertisfu force-pushed the 612-introduced-recap-search-alerts branch from fc5720e to d102664 Compare July 29, 2024 16:09
Comment on lines +3108 to +3110
child_total_query = child_docs_count_query.extra(
size=0, track_total_hits=True
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

Ignore this finding from avoid-query-set-extra.

Comment on lines +3094 to +3096
main_doc_count_query = main_doc_count_query.extra(
size=0, track_total_hits=True
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

Ignore this finding from avoid-query-set-extra.

@albertisfu
Copy link
Contributor Author

Just confirming that the RECAP Search Alerts UI will be controlled by recap-alerts-active waffle flag.

cl/alerts/utils.py Outdated Show resolved Hide resolved
cl/search/constants.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertisfu Here are some suggestions for refactoring that I believe would improve the readability of this PR

cl/alerts/management/commands/cl_send_recap_alerts.py Outdated Show resolved Hide resolved
cl/alerts/management/commands/cl_send_recap_alerts.py Outdated Show resolved Hide resolved
cl/alerts/management/commands/cl_send_recap_alerts.py Outdated Show resolved Hide resolved
cl/alerts/management/commands/cl_send_recap_alerts.py Outdated Show resolved Hide resolved
cl/alerts/management/commands/cl_send_recap_alerts.py Outdated Show resolved Hide resolved
cl/alerts/management/commands/cl_send_recap_alerts.py Outdated Show resolved Hide resolved
…rogress

This commit introduces a new dataclass to store and manage information related to running Elasticsearch tasks. The dataclass includes properties for task completion status, created and total document counts.
dds a new dataclass to encapsulate common Elasticsearch queries used throughout the codebase. This centralizes query definitions, making it easier to maintain and update them.

Updates the `build_es_base_query` method to return an instance of `EsMainQueries` instead of a tuple. This ensures consistent query structure and simplifies future modifications.
@albertisfu
Copy link
Contributor Author

Thanks, @ERosendo, for the changes applied. They look good!

@mlissner, this is ready for merging. I've already created the waffle recap-alerts-active in production to ensure that the UI for creating RECAP alerts is disabled for the public.

The remaining task once this is merged is to set up a cron job for the cl_send_recap_alerts command, which is responsible for indexing documents into the sweep index, sending RECAP RT and Daily alerts, and scheduling the RECAP Weekly and Monthly Alerts.

A good time to schedule the cron job would be at midnight, but we also need to consider the scheduling of the cl_send_scheduled_alerts cron for Weekly and Monthly alerts. Ideally, cl_send_recap_alerts should finish before cl_send_scheduled_alerts to ensure that alerts for the end of the week and month are included in the Weekly and Monthly notifications.

@mlissner mlissner merged commit 6267f3c into main Oct 19, 2024
13 checks passed
@mlissner mlissner deleted the 612-introduced-recap-search-alerts branch October 19, 2024 00:17
@mlissner
Copy link
Member

IN IT GOES! THANK YOU BOTH!

For the sweep cronjob, we can just not do that until you get back, right? It'll just mean that certain alerts don't send, which I think is fine for the moment? If that's the case, let's just make a new issue for that, and we can put that on your list for your return.

@mlissner
Copy link
Member

Eh, actually, I take that back. I'll try to get this done next week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants