Grype DB network and CDN issues

Context

Grype needs up-to-date vulnerability data to scan accurately. For this reason, every Grype invocation, checks whether there is a new Grype database available, and, if one is available, downloads it. The check is performed by downloading a file called listing.json and comparing the latest listed database with the current database. If the current database is older than the newest database in the listing file, the URL in the listing file is used to download a new database. The listing.json and the new database files are hosted in S3 and cached by a CloudFlare Business-tier CDN.

When Grype begins execution, it parses (or generates) the SBOM in parallel with the database update, but the scan cannot begin until both are complete. Therefore, if Grype’s download of the listing file and/or the database takes longer than parsing/generating the SBOM, every second of this latency is experienced by end users as increased execution time for Grype.

We have a number of customers complaining about slowness in Grype, and have isolated the slowness to slow response times from the CDN. These slow response times affect both downloading the new Grype database and checking whether a new Grype database is available. (issue, issue, issue)

We made a previous attempt to fix this issue (pr) by setting timeouts on Grype’s call to both check whether an update is available (default 30s) and download the new DB (default 120s). This changes the customer experience when the CDN is slow from a hang (the previously used client had no timeout) to a fault after 30 or 120 seconds, depending on which step is slow. However, this is an incomplete fix, since it has two bad customer experiences: First, customers on slow connections will find that the default timeouts are too low, and will have a normal, progressing download interrupted by the timeout and thus grype fail to execute. Second, customers who encounter the CDN fault will have Grype hang for some time before inevitably failing.

Gathering more data

Because users are the ones bringing this issue to our attention, we should fix our monitoring. Anchore team members will work to set up DataDog synthetic monitoring to get data on how often this happens, in what regions, etc, and see if this information gives us more options or more insight.

Options

Detecting the failure mode and retrying

The failure mode that Grype DB experiences is interesting: The CDN accepts the connection, sends response headers, and starts sending the response body, but then the rest of the response body never arrives or slows to a crawl. Users have reported 30s+ times to download ~150KB JSON documents, and have captured screenshots of progress bars effectively paused mid-way through the download.

It would be ideal to distinguish between this failure mode, which I’ll call a “stalled connection,” and a working but slow connection which is annoying but will eventually succeed. It would be helpful if Grype could immediately kill and retry a stalled connection, but let a connection that is still reading, however slowly, keep trying.

We can do this by wrapping the response body reader in a body that repeatedly calls conn.SetReadDeadline on the underlying connection during its read method, essentially saying, “if X seconds pass between times when there are new bytes to read, fail the connection.” It is hoped that an X can be found where a slow connection always sends new bytes before X seconds have elapsed, but only X seconds need to elapse before a stalled connection is killed and retried.

It’s also worth noting that retries increase the load on our CDN, and our CDN being overloaded is part of the cause of the issue. In general, retries increase steady-state reliability of a system, but amplify the effects of congestion and outages.

Just retrying (not recommended)

We could just retry with Go’s regular HTTP Timeout setting. However, this will result in slow connections, which would have eventually succeeded, being repeatedly aborted and retried, and hung connections waiting a long time before retrying. In other words, N seconds is too soon to retry a slow connection and too late to retry a stalled connection for all N.

It’s also worth noting that retries increase the load on our CDN, and our CDN being overloaded is part of the cause of the issue. In general, retries increase steady-state reliability of a system, but amplify the effects of congestion and outages.

Add a ‘min DB age to check’ field

In this option, we change Grype source code so that, if the current database is less than N hours old, we don’t check for a new update at all. This will reduce load on the CDN. A new database is published roughly every 24 hours, so for example if we set N to 4 hours, we expect approximately ⅙ (4/24) of requests to be skipped.

However, this does not even the load out, but only reduces it at the beginning of the day. For example, since there are a large number of Grype invocations around the clock, if we change the default configuration to wait 4 hours after downloading one Grype DB to check for another, we expect to see a large reduction in traffic in the first 4 hours after we release a new database, and then a resumption of full traffic after the newest database is 4 hours old. We wouldn’t expect much reduction in requests per second (RPS) once the newest database was 4 hours old.

Gradually roll out new Grype DBs

Perhaps some CDN configuration can be used to spread out the roll out of the new grype DB. For example, every geographic region could get a new Grype DB at midnight local time. If this were combined with the “min DB age to check” option above, we could achieve, for example, a ⅙ reduction in total load by configuring a 4 our min DB age to check.

Enable Mirrors

Right now, the listing file and all the databases are served from CloudFlare at the same hostname. This generates 10-12M requests / day on CloudFlare. At least one user (issue) has requested the ability to mirror this data. Whether this would help depends a great deal on whether people are willing to host mirrors. It’s worth noting that this would add a different mechanism for offline/air-gapped Grype, although that is already supported.

Explore caching in the scan-action

We should research whether scan-action, which uses grype internally, could use GitHub actions cache in order to hit the CDN less often.

Decision

We’ll see! Still gathering info right now.

One tricky thing might be that a lot of these are running in cloud instances (I am guessing) and will almost certainly re-download as each ‘action’, ‘pipeline’ or whatever doesn’t implement caching.

So it might make a difference to some executions of grype, but might also make no difference at all, if they’re all clueless independent bots with no cache and no memory of the grype database.

Hard to know without data?

We’ve merged a change meant to address this issue: chore: shrink listing file to speed download times by willmurphyscode · Pull Request #347 · anchore/grype-db · GitHub

The upshot of this change is that the listing.json file is ~2% the size it was previously, but that grype db list will only show about 3 latest databases, instead of about 120 latest, for each schema version. The old databases are still available, they’re just no longer indexed in the default listing file. If you need to download old grype databases, for research for example, please let us know by posting here, sending us a message here, or opening an issue on GitHub.

We’ve also set up some DataDog synthetic metrics, so that we can measure whether this change helps. We’ll let this change bake for about a week before deciding whether this changed helped and was sufficient.

Separately, the DataDog synthetics are telling us that the default timeout for downloading the new database, 2 minutes, is definitely too low for some regions and networks, so I’ll raise that, probably to 5 minutes.

This change seems to be working!

We have synthetic clients hitting the grype listing file every 10 minutes. There were 12 failures to download the listing filing during the 3 days before we rolled out the fix, but only one in the ~4 days since the fix. (There’s some weekly cyclicality to the load on this, so comparing Wed to Thu with and without the fix yields: Uptime on the listing file went from 99.01% July 31 - Aug 1 to 100% for Aug 7 - Aug 8).

I think this is enough of an improvement to regard this issue as resolved.

Thanks to everyone who commented and contributed! And remember, the databases we de-listed to save bandwidth are not gone, we’re just not sending their info back to grype every single update check. If you need them for some reason, please get in touch.

1 Like

For anyone interested, there is more information about an update we made here: Grype Vulnerability Hosting Update

1 Like