Context
Grype needs up-to-date vulnerability data to scan accurately. For this reason, every Grype invocation, checks whether there is a new Grype database available, and, if one is available, downloads it. The check is performed by downloading a file called listing.json and comparing the latest listed database with the current database. If the current database is older than the newest database in the listing file, the URL in the listing file is used to download a new database. The listing.json and the new database files are hosted in S3 and cached by a CloudFlare Business-tier CDN.
When Grype begins execution, it parses (or generates) the SBOM in parallel with the database update, but the scan cannot begin until both are complete. Therefore, if Grype’s download of the listing file and/or the database takes longer than parsing/generating the SBOM, every second of this latency is experienced by end users as increased execution time for Grype.
We have a number of customers complaining about slowness in Grype, and have isolated the slowness to slow response times from the CDN. These slow response times affect both downloading the new Grype database and checking whether a new Grype database is available. (issue, issue, issue)
We made a previous attempt to fix this issue (pr) by setting timeouts on Grype’s call to both check whether an update is available (default 30s) and download the new DB (default 120s). This changes the customer experience when the CDN is slow from a hang (the previously used client had no timeout) to a fault after 30 or 120 seconds, depending on which step is slow. However, this is an incomplete fix, since it has two bad customer experiences: First, customers on slow connections will find that the default timeouts are too low, and will have a normal, progressing download interrupted by the timeout and thus grype fail to execute. Second, customers who encounter the CDN fault will have Grype hang for some time before inevitably failing.
Gathering more data
Because users are the ones bringing this issue to our attention, we should fix our monitoring. Anchore team members will work to set up DataDog synthetic monitoring to get data on how often this happens, in what regions, etc, and see if this information gives us more options or more insight.
Options
Detecting the failure mode and retrying
The failure mode that Grype DB experiences is interesting: The CDN accepts the connection, sends response headers, and starts sending the response body, but then the rest of the response body never arrives or slows to a crawl. Users have reported 30s+ times to download ~150KB JSON documents, and have captured screenshots of progress bars effectively paused mid-way through the download.
It would be ideal to distinguish between this failure mode, which I’ll call a “stalled connection,” and a working but slow connection which is annoying but will eventually succeed. It would be helpful if Grype could immediately kill and retry a stalled connection, but let a connection that is still reading, however slowly, keep trying.
We can do this by wrapping the response body reader in a body that repeatedly calls conn.SetReadDeadline
on the underlying connection during its read method, essentially saying, “if X seconds pass between times when there are new bytes to read, fail the connection.” It is hoped that an X can be found where a slow connection always sends new bytes before X seconds have elapsed, but only X seconds need to elapse before a stalled connection is killed and retried.
It’s also worth noting that retries increase the load on our CDN, and our CDN being overloaded is part of the cause of the issue. In general, retries increase steady-state reliability of a system, but amplify the effects of congestion and outages.
Just retrying (not recommended)
We could just retry with Go’s regular HTTP Timeout setting. However, this will result in slow connections, which would have eventually succeeded, being repeatedly aborted and retried, and hung connections waiting a long time before retrying. In other words, N seconds is too soon to retry a slow connection and too late to retry a stalled connection for all N.
It’s also worth noting that retries increase the load on our CDN, and our CDN being overloaded is part of the cause of the issue. In general, retries increase steady-state reliability of a system, but amplify the effects of congestion and outages.
Add a ‘min DB age to check’ field
In this option, we change Grype source code so that, if the current database is less than N hours old, we don’t check for a new update at all. This will reduce load on the CDN. A new database is published roughly every 24 hours, so for example if we set N to 4 hours, we expect approximately ⅙ (4/24) of requests to be skipped.
However, this does not even the load out, but only reduces it at the beginning of the day. For example, since there are a large number of Grype invocations around the clock, if we change the default configuration to wait 4 hours after downloading one Grype DB to check for another, we expect to see a large reduction in traffic in the first 4 hours after we release a new database, and then a resumption of full traffic after the newest database is 4 hours old. We wouldn’t expect much reduction in requests per second (RPS) once the newest database was 4 hours old.
Gradually roll out new Grype DBs
Perhaps some CDN configuration can be used to spread out the roll out of the new grype DB. For example, every geographic region could get a new Grype DB at midnight local time. If this were combined with the “min DB age to check” option above, we could achieve, for example, a ⅙ reduction in total load by configuring a 4 our min DB age to check.
Enable Mirrors
Right now, the listing file and all the databases are served from CloudFlare at the same hostname. This generates 10-12M requests / day on CloudFlare. At least one user (issue) has requested the ability to mirror this data. Whether this would help depends a great deal on whether people are willing to host mirrors. It’s worth noting that this would add a different mechanism for offline/air-gapped Grype, although that is already supported.
Explore caching in the scan-action
We should research whether scan-action, which uses grype internally, could use GitHub actions cache in order to hit the CDN less often.
Decision
We’ll see! Still gathering info right now.