Why indexing an image is much shorter than file-system?

Hi!
Why indexing a file-system is much longer than docker image?
do you have any suggestions how to reduce the time of indexing (talking about ec2 machines for example).

I suspect this is simply because we have to fully scan and index the entire filesystem, which could contain executables anywhere. Whereas containers tend to have an index already of the files held within.

There was a recent discussion on this on syft issue 3145, and on the most recent team live stream starting at this timestamp.

1 Like

What @popey said is pretty much correct: when scanning, Syft needs an index to find files based on glob matches like **/python and other things, like executable status. Syft builds this index before scanning, but when you scan a filesystem the index doesn’t exist and the whole filesystem needs to be traversed in order to build it. Container images, on the other hand, are tar files, which effectively have an index already built that Syft uses. This is one reason, but another reason is that quite often images have significantly fewer files than virtual machines: I suspect if you count the number of files in a ec2 machine vs. a big official docker image, you’ll find a big disparity there.

For what it’s worth, I’ve done an experiment with an alternative parallel indexer that is very promising, but it hasn’t become a priority yet to implement.

1 Like

sounds great!
one of the challenges today to scan an ec2 is the amount of time it takes, so this kind of new resolver can be a really helpful feature.