Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 2024-10-21 #4937

Open
14 tasks
hkdctol opened this issue Oct 18, 2024 · 3 comments
Open
14 tasks

O+M 2024-10-21 #4937

hkdctol opened this issue Oct 18, 2024 · 3 comments
Assignees

Comments

@hkdctol
Copy link
Contributor

hkdctol commented Oct 18, 2024

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Weekly Checklist

Monthly Checklist

ad-hoc checklist

  • audit/review applications on cloud foundry and determine what can be stopped and/or deleted.

Reference

@FuhuXia
Copy link
Member

FuhuXia commented Oct 21, 2024

@hkdctol
The weekly harvest source /harvest/fdic-data-json has been failing for the past 10 weeks, the source url https://www.fdic.gov/data.json is giving 404 error. We might need to notify FDIC poc.

@FuhuXia
Copy link
Member

FuhuXia commented Oct 23, 2024

Solr leader was restarted last night causing harvesting issues. Manually ran db-solr-sync and tracking-update to ensure all datasets are properly indexed.

Solr leader restart has no impact on public users. It only affects catalog-admin users and some backend activity.

Looking at the past three month logs, solr leader restarted 8 times. Comparing the timing to catalog-admin activity, it seems to be semi-closely related to the spike of API requests to /harvest/object/*, which is mostly originated from geoplatform.com automated script. If we can coordinate with geoplatform.com team and give them a time windows when Solr leader restart has minimal impact on our system, that would be ideal. I am thinking it is 6pm ET. Their script can last up to 5 hours, so it should be finished by 11pm ET, that is the window that we dont expect any activity from admin users or backend harvesting.

image

image

@btylerburton
Copy link
Contributor

This is good research here, Fuhu. I like seeing the clear correlation. I don't like that we need to coordinate this; the system should be able to throttle itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In Progress [8]
Development

No branches or pull requests

3 participants