Reducing 'unknowns' via targeted fuzzy binary catalogers

A question for you on the topic of binary fingerprinting like the recent ffmpeg one.

tl;dr; Some ecosystems of different software containers have numerous, common “unknown unknowns” in their SBOMs. Would it be advantageous to target those ecosystems to get a good ‘bang for our bucks’ in terms of identifying more files than we currently do?

Take, for example, these people, who have tons of containers, popular with self-hosters, which ship desirable things to nerds, like Plex, sabnzbd, and tons of others.

They have a load of unknowns (files that syft cannot identify) that are common across all their containers.

For example, something called ‘s6’ which is a process management tool. It’s actively maintained and so there are different versions on these containers, as they’re built at different times.

It’s provided as a source tarball, which the containers obtain via a binary build in another container overlay.

If we had a fuzzy binary cataloger for s6 (and it’s subprojects skalibs, and execline), (all of which are simple ./configure, make, make install, and likely (I presume) don’t ‘hide’ or masquerade their real version number in the binaries), we could knock out those “unknowns” across that entire ecosystem.

Now, that’s one ecosystem (fairly popular as it is).

Is it worth identifying unknowns in very popular ecosystems, and targeting them, if it objectively makes syft (and friends) better at identifying software in containers?

Other potential places to look are MCPs and LLMs which frequently consume similar content that’s new, and not packaged anywhere, so hard to identify.

Well, my morning rabbit-hole was discovering that the example I chose in this thread was a bad one. The author explicitly doesn’t put the version number in the binary.

From: Laurent Bercot <ska-skaware_at_skarnet.org>
Date: Sat, 13 Jan 2024 06:04:57 +0000
>there is no version information option (like say “-V”) for
>the s6 utils. such a command line option should make the
>tool output its version number and terminate.
>
>it would be nice if such an option could be added to the tools.

It would also add boilerplate to every single binary, which would make
them bigger, as well as longer and more annoying to write.

Most of the time, the version information is available elsewhere;
typically, in your package manager. Or in the filesystem if you’re using
slashpackage. (That’s one of the issues with FHS, it requires an
additional system such as a package manager to retain the
meta-information
it loses by having binaries in fixed directories.)

Binaries are not the place to store meta-information. There is nothing
you can programmatically do with version information; if you require a
specific minimal version of a tool, then by policy you should have it on
your system, and you should assume that your requirements are met (and
it’s a bug if they are not).

“true --help” and “true --version” are often mentioned for laughs;
there
is a reason for that.

– Laurent

They seem quite fixed in their stance. Would it be satisfactory to use the path to identify the release? Or is that far too fuzzy?

In the example containers I mentioned, all the s6* binaries live in a path which contains the version:

  "path": "/package/admin/s6-rc-0.5.6.0/command/s6-rc",
  "path": "/package/net/s6-networking-2.7.1.0/command/s6-ucspitlsd",
  "path": "/package/web/s6-dns-2.4.1.0/command/s6-dns-hosts-compile",`

The above taken from:

syft docker.io/linuxserver/sabnzbd:version-4.5.3 -o json | jq '.files.[]|select(.unknowns)|{location,unknowns}' | grep path