Exclude-binary-overlap-by-ownership flag is not working

I am scanning flattened os image where python package is detected twice by both “RPM-DB-Cataloger” and “python-installed-package-cataloger”.

As per doc, I found that “exclude-binary-overlap-by-ownership” is true by default. It means, SBOM should contain only from RPM cataloger and it should not contain from python package cataloger right. am I understanding it correctly? or something wrong with artifact I am scanning.

Below is the identifications:

Python cataloger:

{
“id”: “c4995e2ecbab59ab”,
“name”: “setuptools”,
“version”: “39.2.0”,
“type”: “python”,
“foundBy”: “python-installed-package-cataloger”,
“locations”: [
{
“path”: “/usr/share/python3-wheels/setuptools-39.2.0-py2.py3-none-any.whl_extracted/setuptools-39.2.0.dist-info/METADATA”,

          }
       ],
       "purl": "pkg:pypi/setuptools@39.2.0"

}

RPM cataloger:

{
“id”: “e93dce2edcfb4641”,
“name”: “python3-setuptools”,
“version”: “39.2.0-8.el8_10”,
“type”: “rpm”,
“foundBy”: “rpm-db-cataloger”,
“purl”: “pkg:rpm/python3-setuptools@39.2.0-8.el8_10?arch=noarch&upstream=python-setuptools-39.2.0-8.el8_10.src.rpm”,
“metadata”: {
“name”: “python3-setuptools”,
“version”: “39.2.0”,
“epoch”: null,
“architecture”: “noarch”,
“release”: “8.el8_10”
},
“locations”: [
{
“path”: “/var/lib/rpm/Packages”
}
]
}

  1. How can I exclude the overlap packages which are detected by binary and package catalogers
  2. Also, release information like “8.el8_10” is only available in RPM and not available from python cataloger. Is it possible to add this parameter for package catalogers so that correlator tools can use this information for identifying backported fixes.

Hi @santhosh exclude-binary-overlap-by-ownership only excludes binary files. For example, if I’m on Fedora and do yum install curl, Syft will find an RPM called curl that owns the file a /usr/bin/curl, and will find a binary package called curl at /usr/bin/curl. By default, Syft will deduplicate the binary package in favor of the RPM package, because the RPM has better metadata.

However, this setting doesn’t apply to other catalogers, such as python. The configuration focuses on deduplicating binary packages in favor of OS packages because binary packages have so little metadata.

Different packaging ecosystems use different version formats. It’s not possible to add the RPM version info to a Python package because downstream tools will expect the Python package to have a Python-style version, not an RPM style version.

It’s worth pointing out that, in Grype, we do deduplicate Python packages in favor of RPM packages for certain distros. We do this for those distros that publish vulnerabilities as they’re disclosed, rather than as they’re fixed, because otherwise the deduplication would result in false negatives.

Does that make sense? Right now Syft doesn’t have a setting to deduplicate installed python packages in favor of RPMs that contain python packages.

Should we add a way to exclude language ecosystem packages by file ownership overlap with OS packages? For example if someone does yum install python3-urllib or yum install rubygem-nokogiri, it might be helpful for them to report the RPM and not the PyPI / RubyGem package, and it is arguable more correct to do it this way.

I’ve added the discuss tag so we can discuss this decision on a future livestream.

I’m on Fedora and do yum install curl , Syft will find an RPM called curl that owns the file a /usr/bin/curl , and will find a binary package called curl at /usr/bin/curl . By default, Syft will deduplicate the binary package in favor of the RPM package, because the RPM has better metadata.

I am confused. You are telling that syft chooses RPM over binary package. I think my question is also same. In my case, python is identified as binary cataloger which is “Python Installed-package-cataloger” and also from RPM cataloger. Ideally, syft should only report from RPM since it contains more metadata and ignore from python package installer right?

  1. Also my question is that aggregation/de-duplication did not work. With the flag enabled “exclude-binary-overlap-by-ownership”, I should see only one occurance of setuptools right?
  2. Also, Syft is adding prefix “python3-” for RPM’s name parameter. Is it reading from RPM package or Syft adding this prefix?
  3. Also, I did not see any parent-child relationship available in the SBOM for above mentioned two identications.

Hi @santhosh! Thanks for all the great questions. I will do my best to answer them.

Python packages are not considered binary packages. Binary have "type": "binary" and are found by the binary-classifier-cataloger. Python packages do have some metadata, and so right now they are not excluded. That’s why I used the example of curl above - if Syft finds a single executable at /usr/bin/curl that matches some criteria, it might make a binary package, but by default if there’s an RPM that explains the existence of that file, we’ll do that instead.

No, only binary packages are removed. The duplicate you’re seeing is a python package, not a binary.

No, this is part of the RPM’s name. Syft finds this by querying the metadata about installed packages that RPM-based systems have in their file system, such as /var/lib/rpm/Packages.db for example (this path varies a bit).

Relationships are expressed in the .artifactRelationships array in the Syft SBOM. This is listed by ID. For example, in my Syft JSON for the example image I’m using here, I have a relationship like this:

{
  "parent": "21adfa34dc5eb886",
  "child": "c84d989e6b04a2ca",
  "type": "contains"
}

This tells me that the artifact with ID 21adfa34dc5eb886 contains the artifact with ID c84d989e6b04a2ca. In this case, the parent is the python3-setuptools RPM package, and the child is a file at /usr/lib/python3.12/site-packages/pkg_resources/_vendor/platformdirs/api.py, just as an example.

I think your expectation makes some sense - we already removed binary packages in favor of RPMs. Maybe it should be possible to remove other package types in favor of RPMs. The reason we don’t is that vulnerability matchers might be expecting a PyPI package setuptools, not the RPM python3-setuptools, and so there might be false negatives (that is, missed vulnerabilities) if we drop the package. Specifically, we want to compare PyPI packages to GitHub security advisories in Grype. But it sill make sense to build this feature. We will discuss it on an upcoming community livestream.

I hope my answers made sense! Please let me know if you have any more questions.

Thanks for your time and explaining in details.

{
“parent”: “e93dce2edcfb4641”,
“child”: “543eadb62d092391”,
“type”: “evident-by”
},
{
“parent”: “e93dce2edcfb4641”,
“child”: “bdb28d0f1eff249a”,
“type”: “dependency-of”
},
{
“parent”: “e93dce2edcfb4641”,
“child”: “d6af7aabcf235096”,
“type”: “dependency-of”
}
}

As per your explanation, I don’t see parent child relationship built between python and rpm packages. If you see below, the child does not match above mentioned python package.

  1. Any reason why relationship is not built in this case?
    Because of missing relationship, it is hard to aggregate these packages.
  2. Also, do you have any idea how is relationship calculated between RPM and python packages? Is relationship determined based on name/version or any other criteria?
  3. I do see that Type: artifact.DependencyOfRelationship is available for many catalogers like Java,golang, dotnet, etc but not in python cataloger in the code. Is it a conscious decision?

The reason we don’t is that vulnerability matchers might be expecting a PyPI package setuptools , not the RPM python3-setuptools , and so there might be false negatives (that is, missed vulnerabilities) if we drop the package. Specifically, we want to compare PyPI packages to GitHub security advisories in Grype.

I think it is partially correct. But even RPM can be used to identify vulnerabilities from ELSA right? I don’t have complete knowledge whether each GHSA has a related vulnerability to ELSA and vice versa. If it is always true that each GHSA/CVE contains related ELSA finding then RPM should be sufficient and python packages can be excluded.

I didn’t explain it very clearly. The packages will be related by containing an overlapping set of files, not by containing one another. For example, in this case, the packages that are evidence of the PyPI packages “setuptools” are owned by the RPM package “python3-setuptools”.

Here’s a Python script that demonstrates detecting the overlap:

import json

syft = {}
with open("syft.json") as fh:
    syft = json.load(fh)

rpm_artifacts = [a for a in syft["artifacts"] if a["name"] == "python3-setuptools"]
rpm = rpm_artifacts[0]
rpm_id = rpm["id"]

pypis = [a for a in syft["artifacts"] if a["name"] == "setuptools"]
pypi = pypis[0]
pypi_id = pypi["id"]

ids_owned_by_rpm = {
    r["child"] for r in syft["artifactRelationships"] if r["parent"] == rpm_id
}

files_owned_by_rpm = {
    f["location"]["path"] for f in syft["files"] if f["id"] in ids_owned_by_rpm
}

evidence_of_pypi = {
    r["child"]
    for r in syft["artifactRelationships"]
    if r["parent"] == pypi_id and r["type"] == "evident-by"
}

pypi_evidence = {
    f["location"]["path"] for f in syft["files"] if f["id"] in evidence_of_pypi
}

for overlap in files_owned_by_rpm & pypi_evidence:
    print(overlap)

And when I run it on the SBOM I made my test image:

❯ python file-overlap.py
/usr/lib/python3.12/site-packages/setuptools-69.0.3.dist-info/top_level.txt
/usr/lib/python3.12/site-packages/setuptools-69.0.3.dist-info/METADATA

So we can believe that the RPM package is the “real” package because the RPM owns the files that are evidence of the PyPI package.

While I’m here, this is the Dockerfile I’m using to test this:

FROM fedora:latest
RUN yum install -y python3-setuptools

I’ll answer your other questions in my next post; I just wanted to explain how overlap by file ownership is detected.

We are working on adding dependency relationships to different catalogers, but the relationship between the RPM and the PyPI package isn’t a dependency; it’s an overlap.

You can track more information about our efforts to add dependency information to additional catalogers at Add support for package dependency relationships · Issue #572 · anchore/syft · GitHub. we have implemented dependency detection in the “poetry.lock” cataloger and in the “egg/wheel metadata cataloger”. In some cases, it can be difficult in the Python ecosystem to get more dependency information. For example, “requirements.txt” doesn’t say why anything is in the file; it’s not possible to reconstruct the dependency graph just by looking at a requirements.txt file.

Back to de-duplication: in general, de-duplication of packages can’t be done by the dependency graph, because if package A depends on package B, they are definitionally not the same package. However, if package A owns the the files that are evidence of package B, then they might be “duplicates” in some sense of the word.

This depends on the distro. For example, many Linux distros only publish a CVE in their vulnerability feed when they release a fix, so vulnerabilities that are not fixed yet or won’t be fixed by the distro will be missed if we don’t include the Python package as well. Additionally, RPMs that don’t originate from the distro might pull in vulnerable Python packages, for example.

We do do some de-duplication in Grype when there is overlapping by file ownership, and we believe the distro is comprehensive. The list of distros that this de-duplication is currently done for is here.

In short, we can probably detect that the RPM and the PyPi “setuptools” in this image are the “same” in some sense of the word today, but we don’t de-duplicate for 2 reasons:

  1. The Python package has some metadata, and was really found on the system, even if it got there by someone installing an RPM
  2. Different vulnerability feeds have different completeness. In particular, GHSA is a good source of vulnerability data for PyPI, so including the package can prevent false negatives.

In the case where the distro has back-ported a security fix to a version of the Python package that isn’t fixed on PyPI, this can lead to a false positive. We have an issue open to track this situation in Grype and remove a “vulnerable” ruling from GHSA data in favor of a “fixed” ruling from the distro feed: RFC: Explicit reporting of negative matches · Issue #1426 · anchore/grype · GitHub

I hope these long replies are helpful! Please let me know if you still have questions!