How to choose the right binary snippet when adding/improving binary classifier?

I’m looking to add a binary snippet as part of reviewing a PR, and I have:

❯ make add-snippet redis-server@7.2.5
go run ./manager add-snippet
running: ./capture-snippet.sh classifiers/bin/redis-server/7.2.5/linux-386/redis-server 7.2.5 --search-for 7\.2\.5 --group redis-server --length 100 --prefix-length 20
Using binary file:      classifiers/bin/redis-server/7.2.5/linux-386/redis-server
Searching for pattern:  7\.2\.5
Capture length:         120 bytes
Capture prefix length:  20 bytes
Multiple string matches found in the binary:

1) 7.2.5
2) 7.2.5buildkitsandbox-1725496002000000000

How do I know whether to choose 1 or 2? How would a contributor know whether to choose 1 or 2?

PR is over at update redis classifier by witchcraze · Pull Request #3281 · anchore/syft · GitHub if folks are interested, but this thread is more of a general question.

There is no one size fits all answer to this question. It’s up to the author in the context of the binary they chose for what the correct answer is – it also might be that both are valid!

In this case I take a look at the binary in question:

❯ docker run --rm -it redis:latest bash

root@b5176419c89d:/data# find / | grep redis-server
/usr/local/bin/redis-server

root@b5176419c89d:/data# /usr/local/bin/redis-server --version
Redis server v=7.4.0 sha=00000000:0 malloc=jemalloc-5.3.0 bits=64 build=574a105f30831ada

Here when I match this binary I’d probably expect 7.4.0 to be the right answer, as this is what is reported by --version.

Also it’s best to think about vuln matching downstream: do redis versions tend to have buildkitsandbox-### suffixes? I quick look at CVE info shows no.

Ahh I see. I misunderstood what was going on completely. I thought I was choosing a snippet of the file to add to version control, because I ran add-snippet. I didn’t understand that I was looking at only the matching subset of the snippets.

What the script is telling me is: “Here are N strings from the binary that match your regex; which of them shall I add with ~150 bytes of context?” and what I thought it was telling me was, “Here are N snippets of the binary I might add to version control; which is right?”

Is it concerning that there are two snippets in the binary matching the regex, and that one of them is wrong? Should we tune the regex till there’s exactly one snippet that matches the correct version?

This is certainly the most challenging aspect of adding things to the binary cataloger, in my opinion. I often want some way to “easily” search a binary and get information, especially including null terminators and testing regexes. I have a number of experiments I’ve spent some time with over the past 6 months or so for doing this with varying complexity (both with CLI tools and web UIs). I’d love to spend some time to make this a much nicer experience; happy to provide many more details if this is something people are interested in at all.

Is it concerning that there are two snippets in the binary matching the regex, and that one of them is wrong? Should we tune the regex till there’s exactly one snippet that matches the correct version?

It isn’t concerning here because all it’s searching for is the specific version string – it’s not using a regex.

Thanks @kzantow that makes a lot of sense.

Here’s the full decision info available when running add snippet?

❯ make add-snippet name=redis-server@7.2.5
go run ./manager add-snippet
running: ./capture-snippet.sh classifiers/bin/redis-server/7.2.5/linux-386/redis-server 7.2.5 --search-for 7\.2\.5 --group redis-server --length 100 --prefix-length 20
Using binary file:      classifiers/bin/redis-server/7.2.5/linux-386/redis-server
Searching for pattern:  7\.2\.5
Capture length:         120 bytes
Capture prefix length:  20 bytes
Multiple string matches found in the binary:

1) 7.2.5
2) 7.2.5buildkitsandbox-1725496002000000000

Please select a match: 1

00286bde: 6c6c 6f63 2d35 2e33 2e30 0030 3030 3030  lloc-5.3.0.00000
00286bee: 3030 3000 372e 322e 3500 7374 616e 6461  000.7.2.5.standa
00286bfe: 6c6f 6e65 0052 756e 6e69 6e67 206d 6f64  lone.Running mod
00286c0e: 653d 2573 2c20 706f 7274 3d25 642e 0059  e=%s, port=%d..Y
00286c1e: 6f75 2069 6e73 6973 742e 2e2e 2065 7869  ou insist... exi
00286c2e: 7469 6e67 206e 6f77 2e00 7365 7276 6572  ting now..server
00286c3e: 2e72 6570 6c5f 6261 636b 6c6f 6700 6e65  .repl_backlog.ne
00286c4e: 776c 656e 203e 206c                      wlen > l

Does this snippet capture what you need? (Y/n/q) n
Multiple string matches found in the binary:

1) 7.2.5
2) 7.2.5buildkitsandbox-1725496002000000000

Please select a match: 2

002881fc: 796f 7520 7265 616c 6c79 2077 616e 743f  you really want?
0028820c: 0000 0000 372e 322e 3562 7569 6c64 6b69  ....7.2.5buildki
0028821c: 7473 616e 6462 6f78 2d31 3732 3534 3936  tsandbox-1725496
0028822c: 3030 3230 3030 3030 3030 3030 0000 0000  002000000000....
0028823c: 2320 5365 7276 6572 0d0a 7265 6469 735f  # Server..redis_
0028824c: 7665 7273 696f 6e3a 2573 0d0a 7265 6469  version:%s..redi
0028825c: 735f 6769 745f 7368 6131 3a25 730d 0a72  s_git_sha1:%s..r
0028826c: 6564 6973 5f67 6974                      edis_git

Does this snippet capture what you need? (Y/n/q)

What I’m really doing here is looking at these two xxd dumps and asking, “Which one looks like the part of the binary with version info in it,” correct? Does it matter that the regex /7\.2\.5/ matches the wrong string for the second one? The second one looks more like version info to me.

The thing to look at is the hex on the left: 00 372e 322e 3500. This is [NULL]7.2.5[NULL]. These null-terminated strings are very frequently the version string you want. However, there are also often a number of things that match simple regexes of [NULL]<version>[NULL] format, so ideally there is something either before or after containing something related to the package you’re looking for to help pick the right one. Does the second string with buildkit in it match versions you see published from the project or elsewhere? If so it’s the right thing. If not, it’s probably the first one.

@kzantow but the 7.2.5 isn’t delimited by NULL in the regex being added: update redis classifier by witchcraze · Pull Request #3281 · anchore/syft · GitHub

The actual version number is delimited by NULL on the left but [a-z0-9]{12,15}-[0-9]{19} on the right. In other words, I think the new regex doesn’t match the first snippet, and I should use the second.

Is there a reason that add-snippet searches by the version we expect to find? Shouldn’t it instead search by the regex from Syft source code?

Is there a reason that add-snippet searches by the version we expect to find? Shouldn’t it instead search by the regex from Syft source code?

It’s a bit of a chicken-and-egg problem. When you first create a binary classifier, you don’t know what the expression is going to be. You expect to find a version string matching a specific version, but beyond that probably have no idea about NULL characters, or other strings colocated nearby that might be useful to find the right thing. And really, what you want to do is just test regexes against a binary file until you find the right incantation. The problem is none of the tools are set up for this loop, exactly. And you don’t want to run Syft repeatedly against a potentially large image that’s slow to catalog before you can even test that your regex works, so I think the idea is: capture a snippet and write tests against it so you can iterate on the expression quickly. To get that initial snippet, though, you don’t have an expression – only a version and that’s where development of the tool stopped.

I would point out, when I wrote a blog post about this, I had a lot of these similar questions and added an issue to see about making this better.

But the problem remained – I wanted a way to essentially load an image and test a bunch of iterations of glob patterns and regexes. So, one of the experiments I created an inspect subcommand in Syft that allows you to load an image but keep it in memory, then search with regexes and glob patterns to make sure you’re finding the right things. But the problem doesn’t stop there – when you modify an expression, different versions may have unexpected results, so this command allows you to load multiple images and test the regexes against all of them at the same time to make sure you’re not regressing finding things. The idea is to then take known test images that we have configured and just automate more of this, perhaps, to make iterating on expressions as fast and as useful a loop as it can be.

So, the next question is: why shouldn’t it use the source code? I don’t know! Another aspect of the binary classifier I have issue with is the fact that everything is defined in code (yes, I know, I wrote a lot of this to begin with, but it’s had a lot of contributions and has grown beyond the original scope). So, I also created a dynamic YAML-based cataloger that allows us to define each binary classifier in a separate YAML document. Why does this matter? Because it makes testing each individual cataloger easier and enables dynamically loading configurations, for example in the inspect tool, which is a lot easier than maybe parsing out specific things in Go code, or directly executing the cataloger.

Sorry, quite a tangent, but I just wanted to mention these few things that we have code for already that one day might be good to move forward with.