Why indexing an image is much shorter than file-system?

TimBrown1611 · August 25, 2024, 6:06am

Hi!
Why indexing a file-system is much longer than docker image?
do you have any suggestions how to reduce the time of indexing (talking about ec2 machines for example).

popey · August 26, 2024, 10:09am

I suspect this is simply because we have to fully scan and index the entire filesystem, which could contain executables anywhere. Whereas containers tend to have an index already of the files held within.

There was a recent discussion on this on syft issue 3145, and on the most recent team live stream starting at this timestamp.

kzantow · August 26, 2024, 2:58pm

What @popey said is pretty much correct: when scanning, Syft needs an index to find files based on glob matches like **/python and other things, like executable status. Syft builds this index before scanning, but when you scan a filesystem the index doesn’t exist and the whole filesystem needs to be traversed in order to build it. Container images, on the other hand, are tar files, which effectively have an index already built that Syft uses. This is one reason, but another reason is that quite often images have significantly fewer files than virtual machines: I suspect if you count the number of files in a ec2 machine vs. a big official docker image, you’ll find a big disparity there.

For what it’s worth, I’ve done an experiment with an alternative parallel indexer that is very promising, but it hasn’t become a priority yet to implement.

TimBrown1611 · August 26, 2024, 4:14pm

sounds great!
one of the challenges today to scan an ec2 is the amount of time it takes, so this kind of new resolver can be a really helpful feature.

TimBrown1611 · September 26, 2024, 8:53am

Another question - beside the amount of files, do you think other parameters can effect the indexing time? I did some tests on Windows AMI and I see it takes much longer than linux. It can be related to the number of files, but I want to understand if we can think on a way to reduce this runtime (can take 40m to index a windows machine)

Topic		Replies	Views
Improvements to scanning whole machine Syft discuss	1	102	January 6, 2025
How long should syft really take? Syft	6	57	March 3, 2025
Trying to understand sha256 output in syft spdx-json Syft	5	99	September 14, 2024
Does syft copies the files to a temp directory or reads in from memory? Syft	3	126	August 29, 2024
Does syft has internal exclusions? Syft	1	85	September 25, 2024

Why indexing an image is much shorter than file-system?

Related topics