Troubleshoot StatusCake alert¶
When to use¶
When StatusCake raises an alert about a library site being down.
Prerequisites¶
Procedure¶
- Determine whether the site is accessible by accessing the
/health
endpoint for the site - Note the response
- Log into StatusCake
- Locate the uptime test for the library site in question
- Find the downtime root cause for alert based on the time the alert.
- Click Extra details
- Note the error
- Log into Grafana
- Click Explore
- Add a label filter with the label namespace and value being the name of the site being down.
- Add a label filter with the label app and value php
- Set the time range to include the start of the alert
- Note any errors
Known problems¶
These are a list of known problems and how to address them.
DNS lookup errors¶
Observations¶
- StatusCake reports Request timeout and
EAI_AGAIN
additional data.
Action¶
You can ignore this error. StatusCake is experiencing DNS lookup issues. These are outside the scope of the platform.
Corrupt Drupal cache¶
Observations¶
/health
reports HTTP status code 500- Grafana logs contain PHP exceptions
Action¶
- Log into Lagoon UI
- Locate the project in question
- Locate the environment in question
- Locate Tasks
- Run the task Clear Drupal caches for the environment in question
- Wait for the task to finish
- Verify that
/health
reports HTTP status code 200
Database connection errors¶
Observations¶
/health
reports HTTP status code 500- Grafana logs contain
PDOException: SQLSTATE[HY000]
Actions¶
Do nothing. This is likely caused by a restart of the underlying database. Experience shows that it takes about 20 minutes for the restart to complete.
Note that such errors will affect all sites running on the platform and will result in multiple alerts being raised.