DoS Mitigation
What is DoS Mitigation?
DoS mitigation is an essential part of any P2P application. We need to design our protocols to be resilient to malicious peers. We need to monitor our application for signs of suspicious activity or an attack. And we need to be able to respond to an attack.
Here we’ll cover how we can use libp2p to achieve the above goals.
Table of contents
- What is DoS Mitigation?
- Table of contents
- What we mean by a DOS attack
- Incorporating DOS mitigation from the start
- Monitoring your application
- Responding to an attack
- Summary
What we mean by a DOS attack
A DOS attack is any attack that can cause your application to crash, stall, or otherwise fail to respond normally. An attack is considered viable if it takes fewer resources to execute than the damage it does. In other words, if the payoff is higher than the investment it is a viable attack and should be mitigated. Here are a few examples:
A node opening many connections to a remote node and forcing that node to spend 10x the compute time to handle the request relative to the attacker node. This is attack viable because a single node amplifies its affect 10x. This attack will continue to scale if the attacker adds more nodes.
100 nodes asking a single node to do some work, but if this single node goes down it will indirectly cause the loss of an asset. If the asset is more valuable than the compute time of 100 nodes, this attack is viable.
Many nodes connecting to a single node such that that node can no longer accept new connections from an honest peer. This node is now isolated from the honest peers in the network. This is commonly called an eclipse attack and is viable if it’s either cheap to eclipse this node, or if eclipsing this node has a high payoff.
Generally, the effect on our application can range from crashing to stalling to failing to handle new peers to degraded performance. Ideally we want our application to at worst suffer a slight performance penalty, but otherwise stay up and healthy.
In the next section we’ll cover some design strategies you should incorporate into your protocol to make sure your application stays up and healthy.
Incorporating DOS mitigation from the start
The general strategy is to use the minimum amount of resources as possible and make sure that there’s no untrusted amplification mechanism (e.g. an untrusted node can force you to do 10x the work it does). A protocol-level reputation system can help (take a look at GossipSub for inspiration) as well as logging misbehaving nodes and actioning those logs separately (see fail2ban below).
Below are some more specific recommendations
Limit the number of connections your application needs
Each connection has a resource cost associated with it. A connection will usually represent a peer and a set of protocols with each their own resource usage. So limiting connections can have a leveraged effect on your resource usage.
In go-libp2p the number of active connections is managed by the
ConnManager
.
The ConnManager
will trim connections when you hit the high watermark number of
connections, and try to keep the number of connections above the low watermark.
You can protect certain connections with the
.Protect
method. The ConnManager
is in charge of pruning connections to stay below the
defined high watermark, in contrast, the Resource Manager represents a hard
limit where connections will fail to be created in the first place once we’ve
reached our limits. Use the Resource Manager when you need hard limits and the
ConnManager
when you have a range of connections you want to keep. There are
multiple knobs here that do similar things, so take care to set these. We know
this is not ideal and we are tracking this issue
here, contributions welcome.
In rust-libp2p handlers should implement
connection_keep_alive
to define when a connection can be closed. The swarm will close connections when
the root behavior no longer needs it.
You can also set hard limits on the number of connections your application is
allowed to use. In go-libp2p this is done by the Resource
Manager and setting
limits on the system
scope.
In rust-libp2p this is done by using
ConnectionLimits
and passing it to the
SwarmBuilder
.
js-libp2p users should read the section on connection limits in the js-libp2p docs.
Transient Connections
When a connection is first established to libp2p but before that connection has been tied to a specific peer (before security and muxer have been negotiated), it is labeled as “transient” in go-libp2p and “negotiating” in rust-libp2p. Both go-libp2p and rust-libp2p limit the total number of connections that can be in this state since it can be an avenue for DOS attacks. The defaults should work well for most applications, but you may need to change them if your use case involves supporting a lot of connections at once as quickly as possible, or if you want to only handle very few connections at once. We recommend not changing this until you see tangible benefits. And if so, please let us know by filing an issue – we’d be interested in understanding your use case.
In go-libp2p you can tune this by changing the connection limit in the transient scope.
In rust-libp2p you can tune this with ConnectionLimits
as explained above.
Similarly js-libp2p users can adjust the maxIncomingPendingConnections
value in the connection limits as explained in the js-libp2p docs.
Limit the number of concurrent streams per connection your protocol needs
Each stream has some resource cost associated with it. Depending on the transport and multiplexer, this can be bigger or smaller. Design your protocol to avoid having too many concurrent streams open per peer for your protocol. Instead, try to limit the maximum number of concurrent streams to something reasonable (surely you don’t need >512 streams open at once for a peer?). Multiple concurrent streams can be useful for logic or to avoid Head-of-line blocking, but having too many streams will offset these benefits.
Using a stream for a short period of time and then closing it is fine. It’s the number of concurrent streams that you need to be careful of.
The Identify protocol serves as an example of how a protocol can limit the
number of concurrent streams it uses. For go-libp2p look at how pushSemaphore
is
created
and
used.
For rust-libp2p look at how
MAX_NUM_INBOUND_SUBSTREAMS
is used to limit the number of concurrent inbound substreams.
js-libp2p limits the total number of streams a connection can have on a per-multiplexer basis (see mplex for an example) and then applies further limits at the protocol level when registering a protocol handler.
As another example, imagine we are building an RPC-style protocol where responses take minutes. Here are two ways we could implement it:
- Open a stream for each RPC call, and keep that stream open until the RPC call returns.
- Open a stream for the start of the call then close it. The remote side will open a new stream with the response.
Assume we make a lot of concurrent calls. Method 1 would result in a large number of concurrent and mostly inactive streams. Method 2 would result in a fewer number of concurrent streams, and thus lower memory footprint.
If we add a limit in this protocol of say 10 streams, then method 1 will mean we can only have 10 concurrent RPC calls, while method 2 would let us have a much larger number of concurrent RPC calls.
Reduce blast radius
If you can split up your libp2p application into multiple separate processes you can increase the resiliency of your overall system. For example, your node may have to help achieve consensus and respond to user queries. By splitting this up into two processes you now rely on the OS’s guarantee that the user query process won’t take down the consensus process.
Fail2ban
If you can log when a peer is misbehaving or is malicious, you can then hook up those logs to fail2ban and have fail2ban manage your firewall to automatically block misbehaving nodes. go-libp2p includes some built-in support for this use case. More details below.
Leverage the resource manager to limit resource usage (go-libp2p only)
go-libp2p includes a powerful resource manager that keeps track of resources used for each protocol, peer, connection, and more. You can use it within your protocol implementation to make sure you don’t allocate more than some predetermined amount of memory per connection. It’s basically a resource accounting abstraction that you can make use of in your own application.
Rate limiting incoming connections
Depending on your use case, it can help to limit the rate of inbound
connections. You can use go-libp2p’s
ConnectionGater
and InterceptAccept
for this. For a concrete example, take a look at how Prysm
implements their Connection
Gater.
js-libp2p has a similar connection gater that can be configured on node start up and also allows you to drop connections from peers that try to open too many connections too quickly.
Monitoring your application
Once we’ve designed our protocols to be resilient to DOS attacks and deployed them, we then need to monitor our application both to verify our mitigation works and to be alerted if a new attack vector is exploited.
Monitoring is implementation specific, so consult the links below to see how your implementation does it.
For rust-libp2p look at the libp2p-metrics crate.
For go-libp2p resource usage take a look at the OpenCensus metrics exposed by the resource manager here. In general, go-libp2p wants to add more metrics across the stack. This work is being tracked in issue go-libp2p#1356.
js-libp2p collects various system metrics, please see the metrics documentation for more information.
Responding to an attack
When you see that your node is being attacked (e.g. crashing, stalling, high cpu usage), then the next step is responding to the attack.
Who’s misbehaving?
To answer the question of which peer is misbehaving and harming you, go-libp2p exposes a canonical log lines that identifies misbehaving peers. A canonical log line is simply a log line with a special format. For example here’s a peer status log line that tells us a peer established a connection with us, and that this log line was randomly sampled (1 out of 100).
Jul 27 12:14:14 ipfsNode ipfs[46133]: 2022-07-27T12:14:14.674Z INFO canonical-log swarm/swarm_listen.go:128 CANONICAL_PEER_STATUS: peer=12D3KooWSbNLGMYeUuMSXDiHwbhXHzTJaWZzH95MZzeAob9BeB51 addr=/ip4/147.75.74.239/udp/4001/quic sample_rate=100 connection_status="established" dir="inbound"
To see these kinds of logs make sure you’ve enabled the "canonical-log=info"
log level. You can do this in code like
so,
or by setting the environment variable GOLOG_LOG_LEVEL="canonical-log=info"
.
In rust-libp2p you can do something similar yourself by logging a sample of connection events from SwarmEvent.
js-libp2p logs to stdout based on the contents of the DEBUG
environmental variable. This can be set
to log everything with DEBUG=*
, individual components DEBUG=libp2p:connection-manager*,libp2p:upgrader*
.
In the browser you can set localStorage.debug = 'libp2p:connection-manager*,libp2p:upgrader*'
to achieve
the same thing.
If the logs are too verbose you can also exclude components trace logging DEBUG=libp2p:*,-*:trace
and
the value of the variable is consulted at runtime so you can alter the amount or type of logging while
your application is running.
How to block a misbehaving peer
Once you’ve identified the misbehaving peer, you can block them with iptables
or ufw
. Here we’ll outline how to block the peer with ufw
. You can get the
ip address of the peer from the
multiaddr in the log.
sudo ufw deny from 1.2.3.4
How to automate blocking with fail2ban
You can hook up fail2ban to automatically block connections from these misbehaving peers if they emit this log line multiple times in some period of time. For example, a simple fail2ban filter for go-libp2p would look like this:
[Definition]
failregex = ^.*[\t\s]CANONICAL_PEER_STATUS: .* addr=\/ip[46]\/<HOST>[^\s]*
/etc/fail2ban/filter.d/go-libp2p-peer-status.conf
This matches any canonical peer status logs. If a peer shows up often in these sampled logs, something abnormal is happening. i.e. maybe they are churning connections.
A conservative fail2ban rule for go-libp2p using the above filter would look like this:
[go-libp2p-weird-behavior-iptables]
# Block an IP address if it fails a handshake or reconnects more than
# 50 times a second over the course of 3 minutes. Since
# we sample at 1% this means we block if we see more
# than 90 failed handshakes over 3 minutes. (50 logs/s * 1% = 1 log every
# 2 seconds. for 60 * 3 seconds = 90 reqs in 3 minutes.)
enabled = true
filter = go-libp2p-peer-status # This is the filename of the filter above.
action = iptables-allports[name=go-libp2p-fail2ban]
backend = systemd[journalflags=1]
# This uses systemd for logging.
# This assumes you have a systemd service named ipfs-daemon.
journalmatch = _SYSTEMD_UNIT=ipfs-daemon.service
findtime = 180 # 3 minutes
bantime = 600 # 10 minutes
maxretry = 90
/etc/fail2ban/jail.d/go-libp2p-weird-behavior-iptables.conf
Note that the above configuration is relying on systemd to get the logs for ipfs. This will be different depending on your go-libp2p process.
For completeness here’s my systemd service definition for a Kubo instance:
$ cat /etc/systemd/system/ipfs-daemon.service
[Unit]
After=network.target
Description=ipfs-daemon
[Service]
Environment="LOCALE_ARCHIVE=/nix/store/r4jm7wfirgdr84zmsnq5qy7hvv14c7l7-glibc-locales-2.34-210/lib/locale/locale-archive"
Environment="PATH=/nix/store/7jr7pr4c6yb85xpzay5xafs5zlcadkhz-coreutils-9.0/bin:/nix/store/140f6s4nwiawrr3xyxarmcv2mk62m62y-findutils-4.9.0/bin:/nix/store/qd9jxc0q00cr7fp30y6jbbww20gj33lg-gnugrep-3.7/bin:/nix/store/lgvd2fh4cndlv8mnyy49jp1nplpml3xp-gnused-4.8/bin:/nix/store/0f3ncs289m2x1vmv2b3grd6l9x1yp2m3-systemd-250.4/bin:/nix/store/7jr7pr4c6yb85xpzay5xafs5zlcadkhz-coreutils-9.0/sbin:/nix/store/140f6s4nwiawrr3xyxarmcv2mk62m62y-findutils-4.9.0/sbin:/nix/store/qd9jxc0q00cr7fp30y6jbbww20gj33lg-gnugrep-3.7/sbin:/nix/store/lgvd2fh4cndlv8mnyy49jp1nplpml3xp-gnused-4.8/sbin:/nix/store/0f3ncs289m2x1vmv2b3grd6l9x1yp2m3-systemd-250.4/sbin"
Environment="TZDIR=/nix/store/n83qx7m848kg51lcjchwbkmlgdaxfckf-tzdata-2022a/share/zoneinfo"
Environment=GOLOG_LOG_LEVEL="canonical-log=info" LIBP2P_RCMGR=1
ExecStart=/nix/store/mmvd2akskpaszlradl8qv4v703v1cy11-kubo-0.0.1/bin/ipfs daemon
Restart=always
RestartSec=1min
User=ipfs
Example screen recording of fail2ban in action
fail2ban+go-libp2p screen recording
Setting Up fail2ban
We’ll focus on the specifics around fail2ban and go-libp2p here. The steps to take are:
- Install fail2ban. For a general guide to setting up fail2ban, consult this useful tutorial: https://www.digitalocean.com/community/tutorials/how-to-protect-ssh-with-fail2ban-on-ubuntu-20-04.
- Copy the above files into their respective places.
- The filter definition into
/etc/fail2ban/filter.d/go-libp2p-peer-status.conf
- The rule into
/etc/fail2ban/jail.d/go-libp2p-weird-behavior-iptables.conf
.
- The filter definition into
- Remember you may need to tweak the rule to read from the correct log location or change the systemd service name.
- Remember you need to enable the canonical log level (see the above section for how to enable this log level).
- Restart fail2ban to reload the configuration with
systemctl restart fail2ban
. - Verify our jail is active by running
fail2ban-client status go-libp2p-weird-behavior-iptables
. If you see something like:
Status for the jail: go-libp2p-weird-behavior-iptables
|- Filter
| |- Currently failed: 0
| |- Total failed: 0
| `- Journal matches: _SYSTEMD_UNIT=ipfs-daemon.service
`- Actions
|- Currently banned: 0
|- Total banned: 0
`- Banned IP list:
Then you’re good to go! You’ve successfully set up a go-libp2p jail.
Deny specific peers or create an allow list of trusted peers
The go-libp2p resource manager can accept a list of trusted multiaddrs and can use a different set of limits in case the normal system limits are reached. This is useful if you’re currently experiencing an attack since you can set low limits for general use, and higher limits for trusted peers. See the allowlist section for more details.
js-libp2p provides a straightforward allow and deny list mechanism with its connection manager limits.
Summary
Mitigating DOS attacks is hard because an attacker needs only one flaw, while a
protocol developer needs to cover all their bases. Libp2p provides some tools
to design better protocols, but developers should still monitor their
applications to protect against novel attacks. Finally, developers should
leverage existing tools like fail2ban
to automate blocking misbehaving nodes
by logging when peers behave maliciously.