Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 2023-04-21 #4281

Closed
10 tasks
nickumia-reisys opened this issue Apr 14, 2023 · 4 comments
Closed
10 tasks

O+M 2023-04-21 #4281

nickumia-reisys opened this issue Apr 14, 2023 · 4 comments
Assignees

Comments

@nickumia-reisys
Copy link
Contributor

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Miscs

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Check Production State/Actions

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Weekly Checklist

@robert-bryson
Copy link
Contributor

Update from yesterday:

  • From Saturday 2023-04-15

    • Catalog Prod high error rate - self resolved after 116 min
      • Logs missing for duration, perhaps actual outage
    • Catalog Prod high error rate - self resolved after 24 min
      • Appears to have keep serving traffic, albeit with up to 9% errors
  • From Sunday 2023-04-16

  • Catalog Prod high error rate - self resolved after 146 min

  • Updated O&M dashboard errors graph to show last 30 mins to match others

  • 4 critical issues with GSA/datagov-brokerpak-solr on snyk dashboard

@robert-bryson
Copy link
Contributor

Upddate from yesterday's tasks:

  • Dedupe 500 error from Nick on Slack
  • Harvest emails
    • Possible issue with BLS: HTTPError getting json source: 403 Client Error: Forbidden for url: [https://www.bls.gov/data.json](https://www.bls.gov/data.json).
  • Alerts:
    • Catalog Prod high error rate - self resolved after 472 min
      • Over 7hrs of possible down time? Logs missing in NR from ~4pm-midnight.
    • Solr email: Cloudwatch solr memory alert
    • Catalog Prod high error rate - self resolved after 49 min
    • Catalog Prod high error rate - self resolved after 41 min

@robert-bryson
Copy link
Contributor

Update from yesterday:

  • Synk bump sass in resources.data.gov
  • Synk requirements bump in catalog.data.gov
  • Harvest emails look about the same as yesterday:
    • 403 issue with BLS, but I believe that Fuhu implemented a fix
    • 24 ed-gov schema errors
    • 20 DOT socrata schema errors
    • 78 DOT NTAD schema errors
    • 20 healthdata.gov schema errors
  • Alerts:

@robert-bryson
Copy link
Contributor

Notes from Thursday/Friday:

[[2023-04-20]]

  • Harvest error emails same as yesterday, but it does look like the BLS issue is resolved
  • Snyk PR to inventory for sqlparse.
  • Alerts:

[[2023-04-21]]

  • Inventory solr issue caused inventory outage

    • Fix implemented by Fuhu/Nick, new solr service created to have longer fix
    • Fix to bump solr brokerpak to help restarts
  • Harvest emails:

    • Overall good
    • 403 for state-gov State JSON at https://www.state.gov/data.json
    • Error for ogr city-and-county-of-durham-north-carolina: ConnectionError getting json source: HTTPSConnectionPool(host='opendurham.nc.gov', port=443): Max retries exceeded with url: /data.json (Caused by SSLError(CertificateError("hostname 'opendurham.nc.gov' doesn't match either of '*.durhamnc.gov', 'durhamnc.gov'")))
    • Error for edac-unm-edu: Unable to get content for URL: http://rgismetadata.unm.edu/19115/collections/: HTTPError('404 Client Error: Not Found for url: http://rgismetadata.unm.edu/19115/collections/')
    • Error for city-of-baltimore: HTTPError getting json source: 404 Client Error: Not Found for url: https://data.baltimorecity.gov/data.json?version=2.
    • Error for city-of-santa-rosa: HTTPError getting json source: 404 Client Error: Not Found for url: https://data.srcity.org/data.json.
    • Error for NCDDC OER: Transformation to ISO failed
  • Duplicate checks:

         "USGS:62bb92fad34e88efc5245d4c": 2,
         "ark:/88434/mds2-2596": 2,
         "https://data.iowa.gov/api/views/aa2a-3p4g": 2,
         "https://data.iowa.gov/api/views/e7er-326q": 2,
         "https://doi.org/10.23719/1526064": 2
       "guid": {
         "03409563705445619de7adcc6e465e78": 2,
         "0beac6b7ee604d259e4d0d0d44c827a2": 2,
         "110323f5cd35451a8c38b7dcae068d84": 2,
         "20dd3dde7eab4f7aacfa6d515323ef46": 2,
         "2370cf694c824ff1ae8f767678b9cee3": 2,
         "27235c38742f4f75b5c1326fb4cf0a0b": 2,
         "2a42ed42c218482295e4fd9e5a871c50": 2,
         "2dea594adf3444f7b70d9c26cb684909": 2,
         "46503f0d58fd43d989d388e5f15e7604": 2,
         "4d131ab1b90b4de6ad64f3c9a939a019": 2,
         "5fac17ce12b3467a9a7ba65a9ed7cc19": 2,
         "6594c292c4a24f18a82beb41d1d9c4da": 2,
         "725ae3be307d455f9b0058fa4fb5f4f4": 2,
         "76dfb04a17834681af1389b2ddab15ce": 2,
         "ad21d343d6424c028da7c36ea719f028": 2,
         "c411884f6e5c4cffa13b0abd540bd312": 2,
         "d46a18beb67d40a3bf78c5a7c094a5a0": 2,
         "f968793ae1124378a424b58b52afe36e": 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants