Is there a reason that add-snippet searches by the version we expect to find? Shouldn’t it instead search by the regex from Syft source code?
It’s a bit of a chicken-and-egg problem. When you first create a binary classifier, you don’t know what the expression is going to be. You expect to find a version string matching a specific version, but beyond that probably have no idea about NULL characters, or other strings colocated nearby that might be useful to find the right thing. And really, what you want to do is just test regexes against a binary file until you find the right incantation. The problem is none of the tools are set up for this loop, exactly. And you don’t want to run Syft repeatedly against a potentially large image that’s slow to catalog before you can even test that your regex works, so I think the idea is: capture a snippet and write tests against it so you can iterate on the expression quickly. To get that initial snippet, though, you don’t have an expression – only a version and that’s where development of the tool stopped.
I would point out, when I wrote a blog post about this, I had a lot of these similar questions and added an issue to see about making this better.
But the problem remained – I wanted a way to essentially load an image and test a bunch of iterations of glob patterns and regexes. So, one of the experiments I created an inspect
subcommand in Syft that allows you to load an image but keep it in memory, then search with regexes and glob patterns to make sure you’re finding the right things. But the problem doesn’t stop there – when you modify an expression, different versions may have unexpected results, so this command allows you to load multiple images and test the regexes against all of them at the same time to make sure you’re not regressing finding things. The idea is to then take known test images that we have configured and just automate more of this, perhaps, to make iterating on expressions as fast and as useful a loop as it can be.
So, the next question is: why shouldn’t it use the source code? I don’t know! Another aspect of the binary classifier I have issue with is the fact that everything is defined in code (yes, I know, I wrote a lot of this to begin with, but it’s had a lot of contributions and has grown beyond the original scope). So, I also created a dynamic YAML-based cataloger that allows us to define each binary classifier in a separate YAML document. Why does this matter? Because it makes testing each individual cataloger easier and enables dynamically loading configurations, for example in the inspect tool, which is a lot easier than maybe parsing out specific things in Go code, or directly executing the cataloger.
Sorry, quite a tangent, but I just wanted to mention these few things that we have code for already that one day might be good to move forward with.