Why is syft reporting hundreds of random files?

I get 1K+ file entries that seem uninteresting, some examples:

  • /usr/share/zoneinfo/zone.tab - random linux configuration files
  • /usr/share/doc/libssl3/copyright - random text files
  • /usr/lib/x86_64-linux-gnu/gconv/IBM423.so - this should be the gconv package

There was a related thread in Feb but it has no resolution Why syft version 1.20 is now listing files in the SBOM as default?

I tried to --select-catalogers "-file" but then syft tells me:

[0002]  WARN no file catalogers selected but file selection is configured as "owned-by-package" (this may be unintentional)

I don’t understand what does it mean.

How to exclude such files from being reported?

Hi @Jakub_Bochenski, I think based on community feedback in this area we’ve determined there are some improvements to the options we might need. There are a number of file-related options, but ultimately even if you use --select-catalogers -file, Syft includes files it used for finding package evidence.

The fact is, Syft reads those files regardless of the file-related options – it must read these to figure out the packages. And since it already did that work, we might as well include those files in the resulting SBOM. Some of them are in the locations section of a package, indicating they were scanned to find evidence of the package. I’m not sure we want to remove this file information, but this is separate from the files section that includes hashes and such.

The WARN message you see means you have the default configuration option of the file selection section:

file:
  metadata:
    # select which files should be captured by the file-metadata cataloger and included in the SBOM.
    # Options include:
    #  - "all": capture all files from the search space
    #  - "owned-by-package": capture only files owned by packages
    #  - "none", "": do not capture any files (env: SYFT_FILE_METADATA_SELECTION)
    selection: 'owned-by-package'

… this means that files used to identify a package will be included and hashed, if configured.

This is tangential to the --select-catalogers -file option, which disables the file catalogers that separately hash files or read metadata, etc.: syft/internal/task/file_tasks.go at main · anchore/syft · GitHub … but files may still be added to the SBOM because they were read during the cataloging process.

I think the end result is certainly a little confusing. We should try to nail down the behavior that’s expected in all the cases.

Hello, and again thanks for a good explanation.

Still I’m a bit confused, or maybe my question wasn’t clear.

What I’m after are .components[] entries with type: file (in cyclonedx format)
When I specify --select-catalogers -file they are gone, but you are suggesting they would stay anyway?

I wrote that those are not interesting, because my aim is to feed the generated SBOM to vulnerability and license analysis.
There is nothing useful I can do with components that only have a content hash and no package identification.
I guess there might be other use cases, but I think this is one is common.