Designing a Bitcoin node crawler

Motivation

Although Bitcoin has been around for a while now, some of its parts remain poorly documented. For experts, “the source code is the documentation.” But for people taking their first steps into Bitcoin, finding the corresponding piece of code and interpreting it correctly can be quite challenging. In many instances, examining a particular aspect in practice can help clear up these uncertainties. For that reason, this article combines the discussion of parts of the peer-protocol specification with a practical application in a node crawler.

Initial node discovery

Most network protocols follow the client-server paradigm, in which multiple clients connect to a central server. Bitcoin, in contrast, uses a so-called peer-to-peer network topology, which means that nodes in the network are directly connected to each other. Typically, a fresh node discovers peers to connect to using hard-coded DNS seed nodes. (see, e.g., here for Bitcoin core). When queried, these seed nodes reply with a list of node IP addresses. The practical application of this initial peer discovery is demonstrated in the bootstrap_nodes function in the node crawler:

def bootstrap_nodelist(self):
for seed in DNS_SEEDS:
hosts = socket.getaddrinfo(seed, 53, proto=socket.IPPROTO_TCP)
for host in hosts:
self.pending.add(Address(host[4][0], 8333))

Connecting to a Bitcoin node

Establishing a connection to a Bitcoin node involves a four-way handshake. This handshake begins by the initiating node sending a version message to a desired peer. This version message contains several relevant pieces of information, such as the node's supported protocol version, its current time, IP addresses, current height of the node's blockchain, and so on (see here for a complete list). The peer then replies with a verack message to acknowledge the receipt of the version message. Next, the process is repeated in the other direction: the peer sends the requesting node a version message and the initiating node replies with a verack message. Once those four messages have been exchanged, the handshake is complete and the connection to the node is established.

Regular node operation

At this point, a regular Bitcoin node examine the version messages from its peers to determine the height of their blockchains. In addition, nodes also exchange getblocks messages, where each node communicates the hash of the highest block in its blockchains. The node with the longer blockchain will recognize the other node's hash because it is already in its blockchain. The node with the longer chain then sends one or more inv (inventory) messages. Each inv messages advertises the hashes of 500 missing blocks to the node with the shorter chain. The node receiving the inv messages can request missing blocks by sending getdata messages for the hashes it received.

Discovering additional nodes

Once a connection has been established, a node can be queried for a history of its peers using the getaddr message. The queried node will reply with one or more addr messages, each of which typically contains around 1,000 entries. Each entry represents a previously seen node and includes a timestamp, indicating when the node was last seen, and the corresponding IP address and port. Once a queried node has sent its history of nodes, it simply stops sending addr messages. The fact that no more data will be sent is thus implicit, so at any given point in time the requesting node can never know whether the queried node will send more addr messages in the future. A simple heuristic for the requesting node to decide whether the queried node is done sending addr messages is to use some sort of timeout. The node crawler implements the following strategy:

async def get_peers(self):
self.send(GetAddrMessage())
new_nodes = set()
start = time.time()
while time.time() - start < 30:
try:
message = await self.wait_for(AddrMessage, timeout=10)
except asyncio.TimeoutError:
break
new_nodes |= set(message.addresses)
return new_nodes

Finding active nodes

With mechanisms of initial node discovery, connecting to nodes, and requesting peers’ node histories covered, determining the set of active Bitcoin nodes is simple. The following is a (deliberately shortened) version of the function that implements the crawling logic.

async def crawler(self):
while True:
if not self.pending:
return
node = random.sample(self.pending, 1)[0]
self.pending.remove(node)
try:
connection = await self.connect(node)
except (ConnectionRefusedError, ..., asyncio.TimeoutError):
self.unreachable.add(node)
continue
self.active.add(node)
peers = await connection.get_peers()
await connection.close()
for peer in peers:
if peer in (self.active | self.unreachable | self.pending):
continue
if peer.timestamp < (time.time() - (24*60*60)):
continue
self.pending.add(peer)

Performance optimization

It turns out that most of the node crawler’s execution time is spent waiting for other node’s replies over the network. Instead of letting this time spent waiting go to waste, it can be used to speed up the processing of the set of pending nodes by running multiple crawlers in parallel. To this end, the async and await keywords from python's asynchronous I/O library (asyncio) can be used to make a crawler yield the CPU whenever it performs a blocking operating (such as, e.g., waiting for a peer's reply). This way, whenever a particular crawler yields the CPU, it enables another crawler to fill the waiting time with useful work.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store