Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck Jobs Reporter and Fixes and S3 Buckets updates #19535

Merged
merged 13 commits into from
Sep 27, 2023

Conversation

AdamShawBAH
Copy link
Contributor

@AdamShawBAH AdamShawBAH commented Sep 20, 2023

Resolves Deploy and Validate in production for Claimdatedt
Resolves Deploy and Validate in production - DTA SC Creation Failed
Resolves Validate in production for claim not established
Resolves Deploy and Validate in production for BGS::ShareError
Resolves Validate in production - Can't create a SC DTA for Appeal
Resolves Migrate S3 updates to Prod and smoke test

Description

Introduces five stuck job remediations and a service to log the changes they make. Descriptions of the stuck jobs listed below

Introduces a job to handle the ClaimDateDt error:

  • The job will search for all decision documents stamped with the ClaimDateDt error, then verify that both uploaded_to_vbms_at and processed_at are populated. The error will be cleared only if both columns have values present
  • During the intake process, a Decision Document is sent to be established in VBMS. The date that was entered in Caseflow is not valid in VBMS because VBMS cannot accept dates in the future. The SyncReviewsJob will catch these errors and retry them. However, the error column is never cleared and the Decision Document is still stamped with this error message.
  • Resolves Deploy and Validate in production for Claimdatedt

Introduces a job to handle the DTA SC creation failed error:

  • The job will search for all HigherLevelReviews stamped with the DTA SC creation failed error
    • For each HigherLevelReview, we search for the associated SupplementalClaim
    • The SupplementalClaim should have a decision_review_remanded_id field that matches the ID of the HigherLevelReview
    • Additionally, the SupplementalClaim will have decision_review_remanded_type equal to "HigherLevelReview"
    • If this SupplementalClaim exists then it has already successfully been created, and we call clear_error! on the HigherLevelReview
  • The DTA SC Creation Failed error occurs on a HigherLevelReview but indicates that a descendant SupplementalClaim was not created when it should have been. It appears that this error is largely, if not entirely, a self-healing issue
    • At one point in the process, the creation of the SupplementalClaim did fail, however Caseflow will retry several times before giving up. It appears that these subsequent attempts have a high success rate
  • Resolves Deploy and Validate in production - DTA SC Creation Failed

Introduces a job to handle the Claim not established error:

  • The job will search for all decision documents stamped with the Claim not established error.
    • To perform a data clean-up on these records we gather the affected Decision Documents, locate the related End Product Establishment for each by way of the veteran_file_number, and check for a populated established_at field and the presence of any of the EPECODES show in the array
    • If both of these conditions are met, clear the error message
  • When a grant has been made on a veteran's appeal, an End Product Establishment is created in Caseflow and sent to VBMS to be established. This occasionally fails and the Claim not established error is saved to the corresponding Decision Document's error column
    • Under normal circumstances, Caseflow is prepared for this failure and will automatically retry to establish in VBMS
    • However, after a successful retry the error field of the Decision Document is never updated and the message remains
  • Resolves Validate in production for claim not established

Introduces a job to handle the BGS::ShareError error:

The BGS::ShareError occurs on three different classes:

  • HigherLevelReview
  • RequestIssuesUpdate
  • BoardGrantEffecuations

There is a failure in pulling information from VBMS which creates the initial error. When Caseflow reruns the fetch and eventually succeeds an EndProductEstablishment with an established_at column with the date/time of establishment recorded will prove the success of this record. The errors on these three records however will remain.
Completing a data clean-up for this error is quite simple


Introduces a job to handle the Can't create a SC DTA for appeal error:

This Job clears the Can't create a SC DTA for appeal error that is on the Decision Document.
The process follow is as follows.

The logic checks that the payee_code on the claimant. related to the decision document is nil. If it is, it then checks the claimant Type. if the type is "veteran" we set the payee_code to 00. If the claimant type is DependentClaimant, we set the payee_code to 10.

We then clear error on the decision document object.
Logs are also written to S3.


Introduces a service to handle logging for the remediations above:

  • The StuckJobReportService generates logs for the previously listed remediations
    • Displays the total count of erroneous records both before and after the remediation is run
    • Documents the individual records as they are processed
    • Notes any record that failed to remediate
    • Sends the report to the appropriate S3 Bucket

Augments the naming convention for all S3 Buckets in Caseflow

  • Refactored all sub-buckets to route to the S3 bucket set in the environment variable
  • All data written to S3 in production, prod test, and UAT will go to their respective buckets and be organized in appropriately named folders
  • Resolves Migrate S3 updates to Prod and smoke test

Acceptance Criteria

  • Code compiles correctly

Testing Plan

  1. Go to Jira Issue/Test Plan Link or list them below
  • For feature branches merging into master: Was this deployed to UAT?

Frontend

User Facing Changes

  • Screenshots of UI changes added to PR & Original Issue
BEFORE AFTER

Storybook Story

For Frontend (Presentation) Components

  • Add a Storybook file alongside the component file (e.g. create MyComponent.stories.js alongside MyComponent.jsx)
  • Give it a title that reflects the component's location within the overall Caseflow hierarchy
  • Write a separate story (within the same file) for each discrete variation of the component

Backend

Database Changes

Only for Schema Changes

  • Add typical timestamps (created_at, updated_at) for new tables
  • Update column comments; include a "PII" prefix to indicate definite or potential PII data content
  • Have your migration classes inherit from Caseflow::Migration, especially when adding indexes (use add_safe_index) (see Writing DB migrations)
  • Verify that migrate:rollback works as desired (change supported functions)
  • Perform query profiling (eyeball Rails log, check bullet and fasterer output)
  • For queries using raw sql was an explain plan run by System Team
  • Add appropriate indexes (especially for foreign keys, polymorphic columns, unique constraints, and Rails scopes)
  • Run make check-fks; add any missing foreign keys or add to config/initializers/immigrant.rb (see Record associations and Foreign Keys)
  • Add belongs_to for associations to enable the schema diagrams to be automatically updated
  • Document any non-obvious semantics or logic useful for interpreting database data at Caseflow Data Model and Dictionary

Integrations: Adding endpoints for external APIs

  • Check that Caseflow's external API code for the endpoint matches the code in the relevant integration repo
    • Request: Service name, method name, input field names
    • Response: Check expected data structure
    • Check that calls are wrapped in MetricService record block
  • Check that all configuration is coming from ENV variables
    • Listed all new ENV variables in description
    • Worked with or notified System Team that new ENV variables need to be set
  • Update Fakes
  • For feature branches: Was this tested in Caseflow UAT

Best practices

Code Documentation Updates

  • Add or update code comments at the top of the class, module, and/or component.

Tests

Test Coverage

Did you include any test coverage for your code? Check below:

  • RSpec
  • Jest
  • Other

Code Climate

Your code does not add any new code climate offenses? If so why?

  • No new code climate issues added

Monitoring, Logging, Auditing, Error, and Exception Handling Checklist

Monitoring

  • Are performance metrics (e.g., response time, throughput) being tracked?
  • Are key application components monitored (e.g., database, cache, queues)?
  • Is there a system in place for setting up alerts based on performance thresholds?

Logging

  • Are logs being produced at appropriate log levels (debug, info, warn, error, fatal)?
  • Are logs structured (e.g., using log tags) for easier querying and analysis?
  • Are sensitive data (e.g., passwords, tokens) redacted or omitted from logs?
  • Is log retention and rotation configured correctly?
  • Are logs being forwarded to a centralized logging system if needed?

Auditing

  • Are user actions being logged for audit purposes?
  • Are changes to critical data being tracked ?
  • Are logs being securely stored and protected from tampering or exposing protected data?

Error Handling

  • Are errors being caught and handled gracefully?
  • Are appropriate error messages being displayed to users?
  • Are critical errors being reported to an error tracking system (e.g., Sentry, ELK)?
  • Are unhandled exceptions being caught at the application level ?

Exception Handling

  • Are custom exceptions defined and used where appropriate?
  • Is exception handling consistent throughout the codebase?
  • Are exceptions logged with relevant context and stack trace information?
  • Are exceptions being grouped and categorized for easier analysis and resolution?

@codeclimate
Copy link

codeclimate bot commented Sep 20, 2023

Code Climate has analyzed commit c8539de and detected 0 issues on this pull request.

View more on Code Climate.

AdamShawBAH and others added 6 commits September 22, 2023 10:43
Co-authored-by: Griffin Dooley <gcd253@users.noreply.github.com>
Co-authored-by: Griffin Dooley <gcd253@users.noreply.github.com>
@AdamShawBAH AdamShawBAH changed the title UAT Stuck Jobs Reporter and Fixes and S3 Buckets updates Stuck Jobs Reporter and Fixes and S3 Buckets updates Sep 22, 2023
@nkutub nkutub merged commit 7d41b9e into master Sep 27, 2023
14 of 15 checks passed
@ThorntonMatthew ThorntonMatthew deleted the Shaw/s3-stuck-jobs-fixes-report-service-prod branch September 26, 2024 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants