Designing a Bitcoin node crawler
In this article, some basics of the Bitcoin network protocol are examined through a practical lens by looking at a Bitcoin node crawler I developed. The node crawler, serpent, is implemented in python and is available on github.
Although Bitcoin has been around for a while now, some of its parts remain poorly documented. For experts, “the source code is the documentation.” But for people taking their first steps into Bitcoin, finding the corresponding piece of code and interpreting it correctly can be quite challenging. In many instances, examining a particular aspect in practice can help clear up these uncertainties. For that reason, this article combines the discussion of parts of the peer-protocol specification with a practical application in a node crawler.
Initial node discovery
Most network protocols follow the client-server paradigm, in which multiple clients connect to a central server. Bitcoin, in contrast, uses a so-called peer-to-peer network topology, which means that nodes in the network are directly connected to each other. Typically, a fresh node discovers peers to connect to using hard-coded DNS seed nodes. (see, e.g., here for Bitcoin core). When queried, these seed nodes reply with a list of node IP addresses. The practical application of this initial peer discovery is demonstrated in the
bootstrap_nodes function in the node crawler:
for seed in DNS_SEEDS:
hosts = socket.getaddrinfo(seed, 53, proto=socket.IPPROTO_TCP)
for host in hosts:
The loop in the second line iterates over all hard-coded seed nodes. The
getaddrinfo function returns a list of hosts for each of the seeds. Each list item returned by the
getaddrinfo function is actually a five-tuple, containing different pieces of information (see here for details). For now, it is sufficient to know that the first element in the fourth tuple (i.e.,
host) corresponds to a host's IP address. Each host's IP address is paired with the default port of the Bitcoin protocol (8333), and added to the
pending set. This is the set of nodes that will later be visited by the node crawler.
Connecting to a Bitcoin node
Establishing a connection to a Bitcoin node involves a four-way handshake. This handshake begins by the initiating node sending a
version message to a desired peer. This
version message contains several relevant pieces of information, such as the node's supported protocol version, its current time, IP addresses, current height of the node's blockchain, and so on (see here for a complete list). The peer then replies with a
verack message to acknowledge the receipt of the
version message. Next, the process is repeated in the other direction: the peer sends the requesting node a
version message and the initiating node replies with a
verack message. Once those four messages have been exchanged, the handshake is complete and the connection to the node is established.
Regular node operation
At this point, a regular Bitcoin node examine the
version messages from its peers to determine the height of their blockchains. In addition, nodes also exchange
getblocks messages, where each node communicates the hash of the highest block in its blockchains. The node with the longer blockchain will recognize the other node's hash because it is already in its blockchain. The node with the longer chain then sends one or more
inv (inventory) messages. Each
inv messages advertises the hashes of 500 missing blocks to the node with the shorter chain. The node receiving the
inv messages can request missing blocks by sending
getdata messages for the hashes it received.
Once the node is up to date, it can start its duty: receiving, validating, and relaying transactions and blocks. None of this, however, is relevant for a node crawler, so the details are not covered here.
Discovering additional nodes
Once a connection has been established, a node can be queried for a history of its peers using the
getaddr message. The queried node will reply with one or more
addr messages, each of which typically contains around 1,000 entries. Each entry represents a previously seen node and includes a timestamp, indicating when the node was last seen, and the corresponding IP address and port. Once a queried node has sent its history of nodes, it simply stops sending
addr messages. The fact that no more data will be sent is thus implicit, so at any given point in time the requesting node can never know whether the queried node will send more
addr messages in the future. A simple heuristic for the requesting node to decide whether the queried node is done sending
addr messages is to use some sort of timeout. The node crawler implements the following strategy:
async def get_peers(self):
new_nodes = set()
start = time.time()
while time.time() - start < 30:
message = await self.wait_for(AddrMessage, timeout=10)
new_nodes |= set(message.addresses)
The crawler begins by sending a
getaddr message to a peer. The
while loop makes sure the node does not receive
addr messages for more than thirty seconds. This turned out to be a good heuristic, for most peers finish sending their peer history in less than thirty seconds. In addition, the maximum allowed interval between consecutive
addr messages is set to ten seconds (cf. the
timeout=10 parameter in the call to
wait_for), so nodes that have finished sending messages or went offline are detected in ten rather than thirty seconds.
Finding active nodes
With mechanisms of initial node discovery, connecting to nodes, and requesting peers’ node histories covered, determining the set of active Bitcoin nodes is simple. The following is a (deliberately shortened) version of the function that implements the crawling logic.
async def crawler(self):
if not self.pending:
node = random.sample(self.pending, 1)
connection = await self.connect(node)
except (ConnectionRefusedError, ..., asyncio.TimeoutError):
peers = await connection.get_peers()
for peer in peers:
if peer in (self.active | self.unreachable | self.pending):
if peer.timestamp < (time.time() - (24*60*60)):
crawler is executed, the
bootstrap_nodelist function is run to populate
self.pending with a set of nodes from the seed nodes. The crawler keeps executing in a loop until no more nodes are left in the set of pending nodes. Each loop iteration starts with picking a random node from the set of pending nodes. The crawler then tries to establish a connection to the selected node using the previously discussed four-way handshake. If a connection can be established, the node is considered active, so it added to the set of active nodes; if the connection fails, it is instead added to the set of unreachable nodes. In the next step, the node is asked for its peer history using the previously discussed
get_peers function. Finally, the connection is closed. Instead of naively adding the peers received by the node to the set of pending nodes, some filtering occurs: nodes that have been seen before (and, therefore, must be in either of the active, unreachable, or pending node sets) are discarded. The same applies to nodes the peer has not seen in more than a day because there is a good chance the node has since changed its IP address. All remaining nodes from the peer's history are added to the set of pending nodes. After this is done, the next loop iteration is executed.
It turns out that most of the node crawler’s execution time is spent waiting for other node’s replies over the network. Instead of letting this time spent waiting go to waste, it can be used to speed up the processing of the set of pending nodes by running multiple crawlers in parallel. To this end, the
await keywords from python's asynchronous I/O library (
asyncio) can be used to make a crawler yield the CPU whenever it performs a blocking operating (such as, e.g., waiting for a peer's reply). This way, whenever a particular crawler yields the CPU, it enables another crawler to fill the waiting time with useful work.
Another way to improve the crawler’s execution time is to not ask every node for its peer history. The rationale behind this optimization is that there might be a certain degree of overlap in peers’ node histories. By only asking some percentage, p, of nodes for their peer history, runtime can be reduced significantly because most of it is spent waiting for
addr messages. The figure below compares the number of discovered nodes to the crawler's runtime for different values of p using five runs for each setting.
While very small values of p results in a very low number of discovered active nodes (e.g., p = 2%), the data indicates that a p ≥ 10% is sufficient to discover all active nodes. Compared to the naive implementation that asked all nodes for its peer history, this optimizations decreases the runtime by a factor of more than three.
If you found the information in this article useful, feel free to contribute:
16pGpaoAhzoneLdRdxPSo9xAAPhzWnP2dA. If you have scientific, Bitcoin-related freelance work, let me know.