Enabling Syft network features from the CLI

For a long time Syft was purely a static analysis tool and did not fetch data from the network to determine package information to be included. This has been changing over time, and as of today Syft is able to search for licenses from remote sources for JavaScript, Golang, and Maven along with the ability to resolve additional required information from these online sources. Because of this, we all thought it was a good time to circle back to the original idea of having a multi-level configuration to enable all features together with a single flag.

With our CLI tools, we generally try to make easy-to comprehend flags with as little ambiguity as possible and keep the terms short, only deviating from these tenets when necessary.

I created a PR to add a --use-network flag, but after some reflection and after starting to port the same behavior to Grype, I realized that network strictly for cataloging is a bit wrong, since there are other network features such as checking for application update or downloading a grype database. So, should these functions also fall under the same flag? And what about customizing the Syft scan based on what a user is most interested in: let’s say the user scans a lot of Java AND Javascript, but only cares about enabling Maven lookups to improve the Java quality and doesn’t want the performance penalty of looking up remote licenses for the Javascript packages?

After some team discussion, we narrowed down the choices if we were to use a boolean flag to: --network, that also applies to other network features or --remote-enrichment to apply just to the network features that search remote sources to enrich the package data with additional information not present such as licenses.

But neither of these boolean-only options allow for specifying individual elements, those would still have to be specified with environment variables or config files, with things like SYFT_GOLANG_SEARCH_REMOTE_LICENSES=false to disable a single specific thing a user might want to disable.

Another approach is to follow the pattern we established with the catalogers to some degree: allowing multiple --network flags to be specified, with some modifier directives in the case a user wants to disable these, for example:

syft --network

could mean: enable all network features, or:

syft --network=all,-golang

could mean: enable all network features, except golang. This has some drawbacks, like these don’t necessarily match cataloger names and would sort-of ad-hoc match the names we’ve used for the cataloger categories of the same types, and we might want to support aliases, like golang and go.

And a final example:

grype --network=none,db

could mean: do no network operations except related to the database.

A final caveat: it looks like the state of the current flags system would allow us to default a value to a []string, such as all if there is no additional information provided – in other words: syft --network could be made to enable all network features. HOWEVER, this does not let flags be specified without the =, so syft --network all does NOT work, whereas syft --network=all DOES work.

So the questions are:

  • is allowing multiple too complicated?
  • do we care much about specifying individual network features or disabling all network activity to require an =, unlike most of our other flags?
  • would requiring a second term be okay? e.g. --network all, --network on, or --network enable
  • should the flag not apply to general network operations, but only data enrichment?
  • other thoughts?
1 Like

I think per-cataloger specification of network stuff is probably too complicated for the CLI, and can be left in the config files where it is today.

One option is to just have two configs, like --network and --no-network, (or --network=true|false) which basically mean, “Syft/Grype should make all the network calls they want if it makes the output better,” and “I don’t want Syft/Grype talking over the network at all.”

Another option is basically to have a 3-position flag that’s like --network=none|default|all where all means, “make whatever network requests improves the result”, none means “make no network requests” and default means, “do what the defaults are today.” (Today’s behavior is: Syft and Grype both check whether a newer version is available, and tell the user about it but do not update automatically, and Grype will check whether a new database is available and, if so, download and import it.)

I think the decision between the two options is basically whether the current behavior is important to preserve as a default. If so, the 3-position option is better, but if not, the 2 position option is simpler.

I’m also wary of the following situation: Syft and Grype make much better results if online data retrieval is enabled, but out of an abundance of caution, it’s disabled by default. Then we’re making the results worse in order to be careful. That situation would remind me of when NVD matching was on by default even for GHSA-covered ecosystems, because despite it’s very high false positive rate, it technically lowered the false negative rate.

Is anyone in the community really opposed to calling ecosystem specific resources (npm, maven central, Go package repository, etc.) to get information about packages that’s not bundled within the package itself, by default? If so, I’d be curious to hear why. Would you opinion change if we made Anchore-controlled datasets that could be downloaded and then queried offline instead? (Maybe this last paragraph should be it’s own discussion topic.)

1 Like

I think if we get to the point of having a 3-level selector and we must use a keyword to enable one of these (using default, none, all as an example), it’s a pretty small jump to include individual capabilities.

The more I experimented with this, the more I’m leaning toward the latter suggestion from my original post. I went ahead and implemented it on the PR, which resulted in an expected bit of friction because the option spans multiple concerns. And maybe that’s a reason to keep it limited to just cataloging things (in which case I don’t think it should have a reference to “network”, since it wouldn’t affect all network things). One thing that didn’t seem to bother me much was requiring a keyword like “on” or “none” to specify the behavior – in other words --network doesn’t work and requires a directive like --network on.

A few examples of how the current PR works, do any of these feel wrong or confusing?

# enables all network features
syft --network on

# also enables all network features
syft --network all

# also also enables all network features
syft --network enabled 

# enables only java network features
syft --network off,java 

# retains the defaults for all network features but enables java
syft --network java 

I understand what you’re trying to achieve and like the idea of specifying multiple things like all, -golang. But I also think it can get really complicated and I personally would then rather use a config file or the corresponding env vars. What would happen if I specify the corresponding env var and do a --network all, which would have precedence?

2 Likes

I think I misunderstood the question at first. If you specify something in a config file, and no env var or flag, it will be used. Currently the way things work in our apps is if you specify something in a config file and an env var, only what’s specified in the env var would be used. And if you specified a flag, regardless of other configured things, only the flag values will be used.

So in this example, --network all would enable all network features, and --network -java would disable only Java. Or --network none would disable all network features, and --network java would enable only Java. This works essentially the same as a boolean flag would: --enable-network enables everything, and SYFT_JAVA_USE_NETWORK=false would override that for the Java network features.

But there is also the “multilevel” configuration – the more specific, if set, is always used. In all cases using SYFT_JAVA_USE_NETWORK would override due to the “multilevel” aspect: this configuration item is priority and is a different configuration location so it doesn’t get overridden by --network (the top-level network options are evaluated, and if the lower per-cataloger options are set, these end up taking precedence, regardless of where they are specified).

The idea here is that most of the time, people would probably either want to use --network on or --network off, but could allow something to be disabled if there happens to be a specific cataloger that may be taking a long time for limited or no benefit (let’s say to look up licenses, but you don’t care about licenses).

In fact, this approach also allows more groups to be specified for individual config options, so perhaps "licenses" could be one of the keys to enable all network but disable remote license lookups only with --network all,-licenses or something.

That said, I agree with the sentiment that we don’t want this to be complicated or confusing for end users. But I’d also rather avoid omitting something that we might have to quickly follow up with because of the performance implications in certain situations. I suspect adding this flag will result in a lot more network usage, that many users are unaware of the network features hidden in configuration items at the moment.

The general rule in our tools is that, for each configurable value check command line args, then check env vars, then check config file, then use default. (See Configuration · anchore/syft Wiki · GitHub). But I think you’re asking the more complicated question of "what happens if I pass --network=all but have a particular piece of config that turns off one bit of the network.

This behavior would really surprise me. If I run syft --network=none ... and it makes network calls to Maven because of something I had in a config file in my home directory, I would consider that a bug.

I think this might be where a 3 position switch on --network works: --network=all means everything can talk on the network and --network=none means nothing can, and --network=default (or failing to pass the arg at all) means respect the multi-level stuff in the config file. How would that sound @kzantow? Or maybe an off always overrides an on?

It would be surprising to me that in some cases the more specific environment variables are honored and in other cases they are not. But more importantly, I think the primary use case for specifying something beyond “on” or “off” is probably either disabling specific network features due to performance or enabling only specific network features because they are the only thing I’m concerned with. If a user cannot do this with --network all or --network none, I don’t think this would be especially useful and we’d have to follow it up with something that does allow for more fine-grained stuff.

What you’ve described is essentially the same as the original idea of just having a boolean flag that is used, and the only way to enable or disable things is with the more specific config options. Except it prevents enabling or disabling things when specified. If we just relaxed that, so in all cases, the environment variables would do what you expect, but the only way to enable or disable individual functions is with the environment variables, this is effectively what the original implementation was. It certainly keeps things simpler, but inevitably I think users are going to ask for more control when they realize that Syft has online enhancement now that they see the flag, and realize that there can be real performance implications and the only way to tune it is with environment variables.

… but maybe this really shouldn’t be about the network and instead be about the remote enhancement/enrichment, so some network features fall outside the scope of this particular flag and we simply call it --remote-enrichment (or insert other similar name). Based on the code in Syft today, it’s currently affecting: SYFT_JAVASCRIPT_SEARCH_REMOTE_LICENSES, SYFT_GOLANG_SEARCH_REMOTE_LICENSES, and SYFT_JAVA_USE_NETWORK (and currently in the PR the Syft application update check). If we did that, however, it really doesn’t change the questions much: how to enable or disable specific features while bulk-enabling or disabling them?

From the community live stream, here’s some conclusions we made:

  • syft 1.0 will not use network options by default
  • the UI should probably hint to users when we find packages that could have leverage online capabilities and suggest using flags
  • want to better characterize performance with network options on
  • we should look at the API docs for rate limiting concerns – are we going to DDOS popular package registries?

From my point of view I think this isn’t a network flag, probably more like --enrich where you can specify all or java,golang or all,-golang.

1 Like