Today, in Syft we have a number of catalogers that end up having overlap in the packages they surface and the files they read. An example of this is: a user installs curl with a package manager and the RPM cataloger finds this package, but the binary cataloger also finds the curl binary and creates a second package of a different type. When this happens, Syft by default will create an ownersip-by-file-overlap relationship between these packages, and a post-cataloging process by default will remove the secondary packages based on this relationship, so users don’t typically see the binary version but only the more correct package-manager version.
As far as I can tell, this deduplication is useful in order to:
- provide the best SBOM to the end user, and by association:
- avoid vulnerability false positives due to more accurate information from package manager vulnerability feeds and the versions used
Beyond these purposes, I’m having a hard time finding the value of the relationships. Maybe Syft has an option to disable the deduplication if a user really wants that, but why not just have a specific function to do so without adding relationships only to later remove them?
We really want Grype to only match on the package manager versions, but there’s already a custom function to filter these out, but this only works because Grype knows which feeds provide better results and making part of that determination in Syft seems like the wrong place to do it.
So, this is all pretty long-winded way of asking: should we get rid of the ownership-by-file-overlap relationship altogether and simply have deduplication functionality that operates without these relationships?