A question for you on the topic of binary fingerprinting like the recent ffmpeg one.
tl;dr; Some ecosystems of different software containers have numerous, common “unknown unknowns” in their SBOMs. Would it be advantageous to target those ecosystems to get a good ‘bang for our bucks’ in terms of identifying more files than we currently do?
Take, for example, these people, who have tons of containers, popular with self-hosters, which ship desirable things to nerds, like Plex, sabnzbd, and tons of others.
They have a load of unknowns (files that syft cannot identify) that are common across all their containers.
For example, something called ‘s6’ which is a process management tool. It’s actively maintained and so there are different versions on these containers, as they’re built at different times.
It’s provided as a source tarball, which the containers obtain via a binary build in another container overlay.
If we had a fuzzy binary cataloger for s6 (and it’s subprojects skalibs, and execline), (all of which are simple ./configure
, make
, make install
, and likely (I presume) don’t ‘hide’ or masquerade their real version number in the binaries), we could knock out those “unknowns” across that entire ecosystem.
Now, that’s one ecosystem (fairly popular as it is).
Is it worth identifying unknowns in very popular ecosystems, and targeting them, if it objectively makes syft (and friends) better at identifying software in containers?
Other potential places to look are MCPs and LLMs which frequently consume similar content that’s new, and not packaged anywhere, so hard to identify.