Design and construction of scalable ad-hoc software overlay networks
Eric Griffis1. Introduction
The subject for this paper is scale. In particular, we focus on the scalability of names. It has been proposed, or perhaps implied, that names may not scale for at least two reasons. First, names are a purely syntactic notion which lack any inherent semantic quality; and second, achieving global consensus of names is prohibitively expensive at large scales. This section outlines the thought process that led me from these concerns to a multi-faceted exploration of ad-hoc distributed software system design. Sections 2 and 3 present the various discoveries I made along the way. Section 4 suggests ideas for future exploration, and section 5 concludes.
1.1 To name is human
Across the spectrum of human activity, names are ubiquitous. We compulsively attach names to things in our environment so we may begin to understand them. As our understanding of a thing grows, the name we give to the thing absorbs new meaning and thus becomes more concise. Similarly, many forms of computer programming are merely abstraction-naming games, where the goal is to express computations with desirable properties like elegance or efficiency. Clearly, the syntactic nature of naming is no barrier to our ability to discover and assign meaning to things in the real world.
The scalability of global consensus is, however, undeniably poor, particularly if we expect consensus to arise in accordance with lofty human ideals like fairness and accountability. When only one agent has authority to assign names, global consensus in small networks emerges almost trivially. As the number of agents with such authority increases, so does the amount of communications overhead required to achieve consensus. Inversely, communications overhead decreases as trust in the controlling agents increases. Hence, large name-based networks tend to degenerate in the face of politics and economic factors.
I found this apparent connection to the soft sciences compelling and wanted to determine how far the analogy would stretch. According to Katouzian [Kat90], philosophy, science, and society are indispensable features of human life. To paraphrase, philosophy provides direction to scientific and social endeavors, society gives purpose to scientific and philosophical pursuits, and science informs the methods of the other two aspects. At the intersections of these aspects, we have economics, politics, and engineering. Economists quantify social activity like resource allocation and exchange; politicians prescribe order for such activity according to moral, ethical, and legal priorities; and engineers design, implement, and analyze the fruits of the decision-making process.

Consider the typical roles of employees at a large software company. The executives identify a problem space and a related client base, then secure funding to hire scientists to search for novel solutions. Once found, the executives bring in engineers to tailor the results to specific occurrences of the problem. This example covers the six aspects of the triangle. Note that each aspect serves a vital role in the success of the company, including the client base and funding source.
As scientists and engineers, we need to do extra work to understand the other four perspectives, which is unfortunate because this understanding can bring valuable intuition to bear when designing multi-agent systems. This claim is not extraordinary. Case in point, the dining philosophers problem [Hoa85], as told by Hoare, involves a philanthropist (programmer) who hires a College of philosophers (processes) that eat (access shared resources) unpredictably. To prevent deadlock, the philanthropist appoints a footman (semaphore) or a group of lackeys (mutexes). Hoare explains livelock as a footman's irrational dislike of certain philosophers, and provides precisely-defined laws (system design) that govern overall behavior.
We can also apply basic knowledge of political structures to get a grasp on our issues with scalable naming. For instance, a system with only one authoritative name-assigning agent is like a dictatorship—changes may be implemented swiftly, but each subject agent must remain in sync with the dictator or risk losing access to basic services. If the dictator is a tyrant, subjects will flee. If demand for basic services overwhelms supply, subjects will starve. To avoid these problems, the dictator must be responsive to local demand and be willing to delegate internally. If the subjects of one dictator wish to interact with subjects of another, both dictators must also be capable of negotiation.
If the analogy is sound, then we should be able to construct arbitrarily scalable naming systems by establishing regions in which distinct dictators enforce local consensus of names and then negotiate inter-region translation conventions. Note that if subjects are capable of operating under several dictators, i.e., holding dual citizenship, then these local regions may overlap. Many interesting parallels exist between distributed systems and political structures—for example, Paxos [Lam01] resembles the American congress.
First idea: decentralized name negotiation
My original plan was to explore the conjectured scalable naming system directly. I decided to simulate a billion-agent network of semi-autonomous regions of authority. To stimulate region formation and evolution, I wanted to simulate irrational ideological bias [Fin81] regulated by a quantifiable trust metric [Gol09], likely modeled as probabilities. After some preliminary calculations and a poor reception, I shelved the idea in favor of higher-level pursuits better suited to my interests.
Second idea: scale-free networks
While reviewing the literature on massive simulations, I discovered that many large, naturally-occurring networks exhibit fractal-like qualities. Popular examples include protein interaction networks and social networks, which are said to be scale free [Bar03]. Scale-free networks can be characterized by three properties: preferential attachment, small network diameter, and resilience to random failure. Preferential attachment is a tendency for older nodes to become better connected than newer ones. For instance, proteins that evolved earlier tend to interact positively with other proteins more frequently than newer ones because of sustained evolutionary pressure. Similarly, well-connected members of a social network get more visibility than new members and so are more likely to make new connections. Finally, hubs emerge naturally in the presence of preferential attachment. Hubs are rare but well-connected, so the diameter of the network remains low and the probability that a hub dies randomly is very small for large networks. Preferential attachment also leads naturally to multiple paths between nodes, which further enhances resilience to random failure.
Many scale-free network services, such as Twitter, Facebook, and GMail, expose a traditional star network interface. The Twitter retweet network, for example, is a scale-free network [Tin12] simulated within Twitter's internal network and exposed through the Web. Why does a less efficient model prevail over a direct scale-free implementation? What prevents widespread adoption of truly scale-free networks? The Twitter core is essentially a public message routing service—an extremely basic non-trivial network—so I tried to build a decentralized Twitter-like service to try to answer these question. I discovered that designing and debugging ad-hoc distributed systems is highly non-trivial and requires specialized tools and methods.
Complexity is a design problem
Reasoning about concurrent programs is, of course, harder than reasoning about sequential programs, even when the concurrent program is composed of sequentially-programmed agents. At first, the code rapidly became littered with communications boilerplate—managing connections and transmitting messages—which complicated exploration of higher-level issues like message routing and management of inter-agent relationships. I needed an abstraction to manage communications and message delivery so that I could focus on the system design.
The Actor model seemed like a possible fit for my needs, so I
tried writing my agents in Erlang. Erlang offers primitive support
for fast, asynchronous communication, but its distribution model
is not suited for ad-hoc distribution. In Erlang, the behavior of
a spawned process is set by the process that initiates the spawn.
In other words, all nodes participating in a native Erlang cluster
trust each other implicitly. This arrangement is exactly backwards
from what I want. Imagine if a Web server merely listened for
connections on TCP port 80 and then executed whatever data came in
like a native binary. This clearly won't do. I was able to code a
proper abstraction to pass tuples over TCP connections, but I had
to manage a character buffer and the resulting code looked more
like C than Erlang. In the end, I chose Racket, which also offers
a complete suite of message management tools based on the standard
Scheme read
, write
, and
eval
operations. Lesson learned: concurrent message
passing is not exactly the same as distributed message
passing.
client (Msg) ->
{ok, Sock} = gen_tcp:connect("localhost", 4545, [binary, {packet, 0}]),
ok = gen_tcp:send(Sock, term_to_binary(Msg)),
ok = gen_tcp:close(Sock).
server() ->
{ok, LSock} = gen_tcp:listen(4545, [binary, {packet, 0}, {active, false}]),
{ok, Sock} = gen_tcp:accept(LSock),
{ok, Bin} = do_recv(Sock, []),
ok = gen_tcp:close(Sock),
io:format("~p~n", [binary_to_term(Bin)]).
do_recv(Sock, Bs) ->
case gen_tcp:recv(Sock, 0) of
{ok, B} ->
do_recv(Sock, [Bs, B]);
{error, closed} ->
{ok, list_to_binary(Bs)}
end.
(define (client msg)
(define-values (in out) (tcp-connect "localhost" 4545))
(fprintf out "~s~n" msg)
(flush-output out)
(close-input-port in)
(close-output-port out))
(define (server)
(define listener (tcp-listen 4545 5 #t))
(define-values (in out) (tcp-accept listener))
(displayln (read in))
(close-input-port in)
(close-output-port out))
Complexity is a construction problem
Though communication overhead is annoying, we can usually mitigate its negative effects on code clarity with whitespace and encapsulation. The bigger problem is in determining if the system will operate correctly. Formal methods exist for verifying the correctness of a concurrent system design, but we can not verify a design until it has been specified. Hence, we often wish to explore the possibilities as we formulate a design. Dynamic visualization is a powerful tool for gauging the correctness of a design in progress.
Some visualization tools exist, but none were very good for evaluating dynamic system designs. Gephi, for example, is geared for analyzing properties of very large graph structures like degree distribution. yEd, another popular tool,is well-suited for smaller structures but provides only static visualizations. To get fast, real-time, dynamic visualization, I ended up writing my own visualization tool, using a combination of JavaScript technologies.
2. DiTMon: dynamic visualization
The visualization tool consists of three components: front, middle and back. The front end is an HTML page that loads the D3.js and Socket.io libraries; it manages a set of nodes and a set of links. The middle component is a Node.js Web server that translates HTTP GET URIs into Socket.io events and dispatches them to the front end. The back end is a Racket library that translates procedure calls into HTTP GET URIs. The Racket library also includes configurable post-event delays to enhance the visibility of select events. All three components are included in the submission archive. To use the visualization library, first embed some visualization calls in the target implementation, then start the Node.js Web server and point a browser at it. As the target executes, its calls into the library are drawn in real time.
Each node on the visualization surface corresponds to an agent process. A directed edge between two nodes represents an established publish-subscribe relationship between two agents.


A thick colored arrow denotes an inter-agent message exchange. I currently support three kinds of messages. A publish message is green, an unsubscribe message is yellow, and any other message is gray with a dotted tail.



The limit on nodes per experiment maxed out around 150 because of high latency, likely due to browser-based rendering of SVG and the relatively high communications overhead of driving the visualization through HTTP and WebSockets. I tried to replace the HTTP server with direct Unix socket communications, but this quickly degenerated into a distracting game of explicit buffer management that resembled the earlier problem with Erlang. Finally, though Javascript is asynchronous, it is not truly multi-threaded.
3. Experiments
I ended with nine complete experiments, but did not achieve a truly scale-free design. Each experiment proceed as follows:
- Start publisher(s).
- Start subscribers.
- Subscribe all subscribers to the publisher(s).
- Publish a message.
- Unsubscribe some subscribers.
- Publish another message.
- shutdown publisher(s).
Each agent contains a message inbox, a set of publishers, and a set of subscribers.
3.1. Case study: star networks
The star network is the classic client-server architecture. When the publisher shuts down, it first unsubscribes any existing subscribers.
3.1.1. Sequential
3.1.2. Concurrent
An interesting discovery was that Racket uses green threads, which causes the apparent layering effect in this version.
Multi-star
This experiment demonstrates a three-publisher network. For visual clarity, no publisher in this example subscribes to another, although a realistic deployment would likely include such dual-role agents.
3.2. Case study: tree networks
To promote tree-like growth, each agent contains at most 3 direct subscribers. Further subscription requests are deferred to one of the existing subscribers for relay, which gives an appearance of agents "walking" along the deferral path. When a node unsubscribes, it must notify all direct children so that they can request another publisher relay. To expedite the shutdown process, I send an abort message instead of a full unsubscribe.
3.2.1. Sequential
3.2.2. Concurrent
Multi-tree
3.3. Case study: wheel networks
The wheel network adds redundancy to the star network by making each subscriber a relay for the next. For large networks, this gives the appearance of a spider stitching together its web.
3.3.1. Sequential
3.3.2. Asynchronous
It turns out that maintaining consistent references to the "next" and "previous" subscriber is not straight-forward in a concurrent setting, so I ended up guarding subscribe/unsubscribe events in the publisher's thread with a semaphore.
4. Future work
Now that I understand better the implications of Racket's green threads, I would like to port the experiments to a platform with better concurrency model, such as Erlang. This will also give me an opportunity to gain performance by figuring out proper buffer management. Also, I suspect there are more tools and literature on the subject of dynamic visualization. Finding these resources would accelerate the development process.
I would also like to continue pursuing a scale-free model. Now that I have supporting tools in place, the hard part is figuring out how to simulate preferential attachment. A direct approach is to choose the set of publisher relays for each subscriber agent probabilisticly, where the probability that a given node becomes a publisher relay is proportional to some function of the number of existing subscriber nodes. Evaluating failures will also require more thought, though a probabilistic approach seems fairly direct.
5. Conclusion
Throughout this work, I've made the assumption that ad-hoc distributed systems are interesting and important. With that in mind, I discovered that designing ad-hoc distributed systems is difficult, and so is constructing them without a formal model. Furthermore, visualization is not trivial—I ended up spending more time on it than I had planned for and never got around to designing a truly scale-free network. I did, however, acquire a deeper understanding of some of the difficulties.
Most importantly, I learned that there is value in general-purpose tools to assist in the design and construction of ad-hoc overlay networks. Though related tools and services exist, trying to fit them to this task can be difficult and distracting.
References
[Bar03] | Barabási, AL, and E Bonabeau. “Scale-Free Networks.” Scientific American 288, no. 5 (2003): 60. |
[Fin81] | Fine, Ben. Economic Theory and Ideology. Holmes & Meier Publishers, 1981. |
[Gol09] | Golbeck, Jennifer, ed. Computing with Social Trust. Human-Computer Interaction Series. London: Springer London, 2009. http://link.springer.com/10.1007/978-1-84800-356-9. |
[Hoa85] | Hoare, C.A.R. Communicating Sequential Processes. Modular Exploration of Technology Series. Prentice Hall International, Incorporated, 1985. http://books.google.com/books?id=GIJQAAAAMAAJ. |
[Kat90] | Katouzian, Homa. Ideology and Method in Economics. Macmillan London, 1980. |
[Lam01] | Lamport, Leslie. “Paxos Made Simple.” ACM Sigact News 32, no. 4 (2001): 18–25. |
[Tin12] | Tinati, Ramine, Les Carr, Wendy Hall, and Johnny Bentwood. “Scale Free: Twitter’s Retweet Network Structure,” 2012. http://eprints.soton.ac.uk/344807/. |