How long should syft really take?

Hi, I’m working on some automation to create SBOMs for my container images, some images are quite large (300Mb +) and it’s taking forever, literally hours.

I’m running this in AWS on an 8 core 8GB RAM machine, I’ve increased parallelism to 5, the CPU is running between 69-99%+ and RAM is hovering right at 62% utilized.

I’ve installed syft on Ubuntu 24.04.1, I used the standard pipe to bash install script.

The general command that I’m running is: syft {image_uri} -o spdx-json > {sbom_file}"

I’ll admit, I’m entirely new to using Syft and creating SBOMs in general, but is it normal for a 300Mb image to take hours to generate the details? Is there an extra config I should consider, do I need an even larger AWS instance, a different version of Syft, a ceremonial dance perhaps?

Thanks in advance -BM

Hello @billy_muller_cyera!

It’s definitely not normal for scans to take hours. For example, I have a 20 gigabyte test image that - excluding download times - takes less than 5 minutes to scan. But there are a lot of factors that contribute to the scan times: size of image (of course), number of files, types of content, etc… Are there any publicly available images exhibiting this behavior? If not, we’ll need a little more information to understand what’s potentially going on. One thing to look at, if you run with -v, you will see a list of cataloger times printed to the log (stderr). That might be a good thing to start with to see if one particular cataloger is taking a long time.

I didn’t see what version of Syft you are using, but Syft 1.20 included a fix for an issue that was causing some users very long times like this due to a library we are using downloading certificates and CRLs when scanning certain types of DLL/EXEs.

P.S. there’s a parallelism option option (SYFT_PARALLELISM=<number>) which will allow multiple catalogers to run in parallel. We’re actively working on some more performance improvements to more fully utilize resources available – there are a lot of things that are currently running serially, hence the somewhat low resource utilization you noted. But you probably shouldn’t be seeing scans take hours!

Hi, I actually tested on a personal image and it wen’t super fast so it must be a few other images. I’ve enabled parallelism up to 5 which I thought was good but it hasn’t made the process any faster on the images that are taking hours.

I have enabled --vv which does give me a few extra details, I am running 1.20, additionally it seems to be getting stuck with these catalogers:

[0358] DEBUG found path duplicate of /usr/lib/node_modules/npm/bin/npm-cli.js
[0358] DEBUG found path duplicate of /usr/lib/node_modules/npm/bin/npx-cli.js
[0358] INFO task completed elapsed=440.685707ms task=apk-db-cataloger
[0358] DEBUG discovered 0 packages cataloger=java-jvm-cataloger
[0358] INFO task completed elapsed=405.542µs task=java-jvm-cataloger
[0360] DEBUG discovered 0 packages cataloger=linux-kernel-cataloger
[0360] INFO task completed elapsed=1.125798454s task=linux-kernel-cataloger
[0360] DEBUG discovered 0 packages cataloger=bitnami-cataloger
[0360] INFO task completed elapsed=330.669346ms task=bitnami-cataloger
[0360] DEBUG discovered 0 packages cataloger=wordpress-plugins-cataloger
[0360] INFO task completed elapsed=1.559628ms task=wordpress-plugins-cataloger

The catalogers you listed are all < 2 seconds. Are you saying it doesn’t seem to proceed after this?

Yes, basically. It just stops there and doesn’t continue any further.

It’s a little difficult to understand which catalogers are problematic with only a partial list of completed ones. If you disabled parallelism (by not setting the environment variable, or just setting SYFT_PARALLELISM=1), this might be more helpful to understand at least the first one that seems to hang. When running in parallel, the ordering isn’t really known, so it’s hard to understand if there are other entries before these which also completed.

I"ll do that next, for now I’m pulling down my docker images, running them and exporting the live container file system to a tar and then running syft against that.

Those scans work incredibly well, so I’ll have to keep digging.