|
|
An Exploration of Power-Law Networks
Part II: Exploration Details
Owen Densmore
Sun Microsystems Laboratories
owen.densmore@sun.com |
|
Overall Strategy: Two Power-Law Network Types
| Networks observed in peer systems and the web exhibit a power-law
distribution of edges per node. This means that a plot of the number
of edges (connections to the network) per node (host in the peer network)
is approximated by an equation of the form Edges = Constant*Nodes^-gamma.
A sample program building such a power law
produces the plot on the right; click for full size.
gamma = -2.1, bins = 100, K = 1
1:7819 2:1824 3:778 4:425 5:266 6:182 7:131 8:99
9:77 10:62 11:51 12:42 13:36 14:31 15:27 16:23 17:20 18:18 19:16 20:14
21:13 22:12 23:11 24:10 25:9 26:8 27:8 28:7 29:7 30:6 31:6 32:5 33:5 34:5
35:4 36:4 37:4 38:4 39:4 40:3 41:3 42:3 43:3 44:3 45:3 46:3 47:2 48:2 49:2
50:2 51:2 52:2 53:2 54:2 55:2 56:2 57:2 58:2 59:1 60:1 61:1 62:1 63:1 64:1
65:1 66:1 67:1 68:1 69:1 70:1 71:1 72:1 73:1 74:1 75:1 76:1 77:1 78:1 79:1
80:1 81:1 82:1 83:1 84:1 85:1 86:1 87:1 88:1 89:1 90:1 91:1 92:1 93:1 94:1
95:1 96:1 97:1 98:1 99:1 100:0
nodes = 12159, edges = 34014
|
|
This network has 12,159 nodes and 34,014 edges. It is characterized
by a "gamma" or power-law exponent of -2.1. It is dominated by nodes
with just 1 edge/connection: 7,819 of them. It decreases quickly: the 10th
entry shows only 62 nodes with 10 connections, and decreases to 1 node
w/ 59 connections, but continues with a "fat tail", out to 1 node w/ 99
connections.
Network Types: Power List and Generated
The exploration uses two general construction techniques to build these
power-law networks.
-
Power List: A program much like the sample above generates a list
of nodes and edges. Then several strategies are used to build a network
having these numbers. They include:
-
Random (Rd): Random connection of nodes and edges, constrained to
the lists values, halting the list values are met.
-
Top-Down (Td): Top-Down connection of the list nodes to each adjacent
neighbor in the list.
-
Generated: This method uses a generation algorithm to build the
power-law network. The constructive method of Barabási, Albert,
and Jeong is an example. It starts with a small number of empty nodes.
It then adds in new nodes, connected to the existing nodes dynamically
and preferentially .. connecting to higher degree nodes with greater probability.
We use two:
-
BA Graphs (Bag): These use the Barabási, Albert, and Jeong
method of dynamic, preferential attachment of nodes.
-
Small Worlds (SmW): This generates a graph using a variation of
the initial Small Worlds rewiring algorithm. It uses the same rewiring
approach but includes the dynamic-preferential ideas of the Bag network
by rewiring on the fly during the construction of the clustered graph,
preferentially attaching to previous nodes based on the number of edges
per node.
A small set (6) of Java classes were created to build these networks, and
a large number of datasets were created for study; 1890 general sets and
an additional 1000 sets specific to Small Worlds networks to check our
results with existing published data. Each dataset is distinguished
by three or more parameters. Each includes "N"; the number of nodes,
and "K", the number of edges per node. Additional parameters are
unique to each type of network. For example, the third parameter
for Small Worlds nets is the rewiring probability.
Nomenclature
Thus a natural nomenclature evolved for naming the classes of graphs:
-
Random (Rd): The Random Power List graphs are built by pre-allocating
the N nodes with K edges for each node, according to the list, then connecting
edge place-holders at random. Graph Theory shows that not all such
graphs are realizable, especially under the additional constraints we propose
such as "connected" (no island of subgraphs) and "simple" (no loops: edges
looping back to same node, and no parallel edges: two nodes connected by
multiple edges). None the less, minor fudges allow us to derive a
graph with nearly the same parameters, but with either more (+)
edges than the model, or less (-). A key variation is whether
the random selection is among the Nodes (N) or Edges (E)
of the power list. Odd as it may seem, randomly choosing incomplete
nodes or available edges created interestingly different networks.
The named variations, therefore are:
-
RdE+: Random selection of unfilled Edges, opting for overfilling
to complete power list parameters.
-
RdE-: As above, but stopping when filling additional empty parameters
would over-fill others.
-
RdN+: As RdE+, but selecting random nodes, not edges.
-
RdN-: As RdE-, but selecting random nodes, not edges.
-
Top Down (Td): This construction technique creates a sequence of
nodes corresponding to the power list, as above. But rather than
connecting the empty edges at random, the edges are filled "top down" by
starting at the beginning of the list, connecting the top node to all the
nodes below it with vacant edges. This continues until the end of
the list is reached. The list starts sorted, with the first node
being the largest. The list is either left in the Linear form (TdL)
or is shuffled (TdS) before the connecting begins. The connecting
can stop at the last node (-) or "wrap around" (+) until all edges are
completed. This results in these variations:
-
TdL-: Top Down Linear; the list is not shuffled before connecting
empty edges, and the connecting stops at the last node. Thus if a
node has not filled all its empty edges when it reaches the last node,
it will be underfilled, leaving some of its empty edge positions vacant.
-
TdL+: Top Down Linear, wrapping around from the last node to the
first to complete wiring the empty edges. This results in overfilling
the nodes on the top of the list.
-
TdS-: Top Down Shuffled; like TdL- but randomizing the list before
starting connections.
-
TdS+: As above, but completing wiring empty edges by wrapping
around from the last of the list to the first. This results in overfilling
the nodes on the top of the list.
-
Barabási/Albert Graphs (Bag): The graph is constructed by
allocating an initial number (N0) of unconnected nodes. We then allocate
the remaining N nodes by assigning K edges to each new node, which are
connected to the already allocated nodes "preferentially". By preferentially
is meant choosing the node to connected based on the number of nodes already
allocated to that node. Thus more popular nodes tend to become even
more popular. It is the combination of dynamic allocation and preferential
connection that generates the power-law distribution of edges within the
network.
Note: The published description for preferential
connection uses a P(i)=Ki/Sum(Ki) [Eqn (3), page 179]. This creates an
initial condition anomaly during the period the nodes have no edges.
In this case Ki=0, and Sum(Ki)=0 , producing an indeterminate condition.
Three resolutions were considered:
-
Connect the initial nodes in a ring or other equal degree
topology, producing equal probability for all initial nodes.
-
Require N0=K (B/A's m0 and m), thus forming an initial star
topology for the first step. All the examples in Fig 4 & 5 do
this.
-
Introduce a constant bias term C to the probability measure.
This produces P(i)=C+Ki/Sum(C+Ki). This sums over i to 1 and creates
an initial state of equal probability for all nodes. If C=1 this
translates to all nodes contributing a minimum of one to the probability
space. Note that C can be any positive real number, not just integers.
We chose the latter approach, using C=1, even though it introduces
a minor distortion. We did so because we wished to experiment later
on with more dynamic Bag's which lose nodes and edges over time simulating
peer sessions. This third style was the only viable solution for
such a dynamic network. |
 |
-
Small World (SmW): The graph is constructed by generating a regular,
clustered graph with N nodes and K edges per node (K even). By this
is meant each node is connected to K/2 of its nearest neighbors.
In addition, with probability P, edges are randomly wired back to previously
allocated nodes. Thus if P=0, a completely regular, clustered graph
is generated. This varies from the initial Small Worlds graph by
adding the Bag notion of dynamic preferential growth, and results in a
power-law distribution of edges. Two variants are generated:
-
SmW+: Rewired edges are added to the graph, in addition to the original
edge.
-
SmW-: The rewired edge is detached from its initial node.
This gives a total of 11 graph "classes": Bag RdE+ RdE- RdN+ RdN- SmW+
SmW- TdL+ TdL- TdS+ TdS-
The complete name of a dataset includes the third parameter. For
all power list graphs, the third parameter is "gamma", the power-law exponent.
For Bag graphs it is N0, the initially empty set of nodes. And for
SmW graphs, it is P, the rewiring probability. Here are examples
-
RdN-100-3-2.2: This is a Random power list, based on Nodes, with
100 nodes, 3 edges per node minimum, and a gamma of (negative) 2.2.
It is "incomplete" (RdN-) meaning the connecting of nodes stopped when
no further connections would increase existing nodes beyond their limit.
-
TdL+1000-8-3.0: This is a Top Down linear (not initially shuffled)
power list of 1000 nodes with an 8 edge minimum, gamma of 3.0. It
is "complete" (TdL+) in that some nodes may be over filled to allow all
nodes to meet their requirements.
-
SmW-500-10-0.05: This is a Small Worlds graph of 500 nodes and 10
edges per node, with rewiring probability of 5%. Rewired edges are
"detached" (SmW-) from their initial destination node.
-
Bag1000-2-4: This is a B/A graph with 1000 nodes and 2 edges per
node, with initially 4 empty nodes.
Graphs
For each style of network, a set of image pairs were generated for N=25,
50 and 100. The image pairs are 1) the entire graph, along with 2)
the same graph with only "parent", root edges drawn. Some examples
are given here; the "family" will be explained below under "Indices".
Click on the image for full size. Click
on the label for a family of related graphs.
The graph images are color-coded as follows:
-
The red node is the "root" node, that node with the most edges in the network.
-
Green nodes are directly connected to the root, at "Level 1"
-
Yellow nodes are at level 2
-
Red bordered edges are at level 3
-
The edges are blue in the "graphs" images, and red in the "roots" images
The ordering of the graphs around the root, or most connected, node provides
interesting upper bound information about the underlying network.
For example, if the maximum "level" (distance from root) is N, then the
largest distance between any two nodes must be 2*N or less .. it is the
distance between the two nodes traversing through the root.
The "dot/neato" graph drawing program from Bell Labs was used to construct
the images.
Plots
Three plots were created for each dataset:
-
Histogram: A log-log edge histogram showing the power-law characteristics
of the dataset. In the case of power list graphs, two power-law functions
are drawn: the initial gamma and the gamma best fitting the resulting data.
For Bag and SmW generated graphs, just the fit line is shown. (The fit
was provided by Gnuplot. The fit title on the plot includes additional
fit parameters: first X value used, and the standard fit and chi-square
reported by Gnuplot.)
-
Levels: A bar chart of the number of nodes at each "level", the
distance to the root, or most connected, node.
-
Searches: A point plot of 3 sets of 100 sample searches.
-
Power Law: This uses the Adamic, Lukose, Puniyani, Huberman algorithm,
with neighbor radius of 3. The search is done by looking at the node,
its edge nodes, and their edge nodes. If the search does not succeed,
"hop" to the most populous node counting the three search levels (the node,
its edges, and their edges).
-
Depth-First: This uses the above radius of three, but randomly choosing
the node to hop to, rather than using the most populous node. (This
is not exactly a classical depth first search due to its searching intermediate
nodes)
-
Breadth-First: Performs a standard graph traversal, but searching
a neighborhood of radius 3, as above.
The data points are sorted (by power-law first) and plotted by rank order;
thus the X axis is simply ordered from 0-99 for the 100 samples.
Visually, the power-law searches form a curve with the other searches scattered
above and below. Points above are "worse" searches, below are better.
Searches start at "1" rather than "0", thus correspond to "looks" not "hops".
This was done to allow log plotting.
Two sets of these were made for each of the graph types, one for N=1000,
one for N=2000. Examples for TdS-2000-3-2.2 are shown:
Histogram, Level, and Search plots for TdS-2000-3-2.2;
click for full size
hist: TdS-2000-3-2.2 |
level TdS-2000-3-2.2 |
search: TdS-2000-3-2.2 |
Indices
To help navigate the data, several "indices" were made.
First, several datasets are combined into a "family" differing
only in the number of nodes in each dataset. The node values are
25, 50, 100, 250, 500, 1000 and 2000, thus 7 datasets are combined per
family page. There are 270 such families/pages.
| Each page has three full graph images
and the three corresponding root/parent-link graph images as above.
These are followed by two sets of 3 plots, for N=1000 and 2000.
The plots are the histogram, level and search plots discussed above.
The page concludes with a digest of the data for the 7 datasets. The
digest includes:
-
Edges: The number of edges for the graph
-
Level Data: The Max, Avg and Median distance of the nodes from the
root node.
-
Cluster Data: The Cluster Coefficient, Characteristic Path and Diameter
of the graph.
-
Sample Search Data: The values for the three searches (Power Law,
Depth 1st, Breadth 1st) along with the shortest path; averaged over 100
samples.
-
Sample Levels: The Level Max, Avg, Median averaged over 100 samples.
The samples are 100 random points (or N if N<100) chosen as root for
breadth first traversals and used for searches above and gathering levels
data.
|
Click for full size
|
The digest for the sample page is reproduced here:
| Name |
Edges |
L:Mx/Av/Md |
ClC/CPath/D |
S:Pl/D1/B1/S |
P:Mx/Av/Md |
| TdS-25-3-2.2 |
60 |
3/1.96/2 |
0.78/2.76/4 |
1.76/4.32/3.04/2.52 |
3.56/2.5/2.76 |
| TdS-50-3-2.2 |
135 |
9/4.51/5 |
0.75/4.72/10 |
7.54/18.92/13.06/4.28 |
8.26/4.58/4.72 |
| TdS-100-3-2.2 |
291 |
11/4.24/3 |
0.77/5.24/15 |
15.65/41.09/34.42/5.88 |
11.55/5.64/5.24 |
| TdS-250-3-2.2 |
814 |
16/6.51/6 |
0.8/8.31/24 |
43.74/121.03/88.8/8.52 |
18.74/8.9/8.31 |
| TdS-500-3-2.2 |
1745 |
20/7.84/7 |
0.79/10.81/37 |
93.91/245.18/210.57/12.2 |
26.84/11.47/10.81 |
| TdS-1000-3-2.2 |
3697 |
24/10.91/9 |
0.81/14.62/43 |
237.84/612.4/436.28/15.93 |
31.67/15.54/14.62 |
| TdS-2000-3-2.2 |
7823 |
52/20.3/19 |
0.87/24.75/75 |
260.64/945.04/877.96/25.06 |
58.24/26.08/24.75 |
Class Summaries
There are 5 summaries of the data given by each of the 11 classes of graph
(Bag RdE+ RdE- RdN+ RdN- SmW+ SmW- TdL+ TdL- TdS+ TdS-). These are
-
graphs
-
roots
-
edge plot
-
level plot
-
search plot
In addition, each of the three plot summaries are given for both N=1000
and 2000. There are therefore a total of 88 summary pages.
These are a very convenient view onto a large set of data. The pages
can be large, however, so may take time to load. Here are examples
for the SmW- class:
Class Summaries for SmW-; click for full size
Class/Family Index
| Finally, there is a single overall Index page.
It is organized by Class, and within Class, by family. Each Class
has a section with one line pointing to the various Class summaries, and
several lines containing pointers to 5 families each.
Below is the Bag section of the Index page, to the right is a link to
the Index itself.
|
Click for Index Page
|
Data
The 1890 datasets are contained in the data/ directory. Each data
file is named using the nomenclature discussed above, followed by ".dat".
Thus the data for the SmW-50-2-0.2 dataset is stored as data/SmW-50-2-0.2.dat.
In addition, the histogram gnuplot fit log is kept in the hist/log/ directory,
i.e. data/fit/TdL+1000-3-2.2.fit.
All data for graphs and plots are taken from these files and are created
by a variety of Unix shell scripts using gnuplot and neato for graphics.
The source files are listed in Files.html.
Two summary files are created, digest.txt and
summary.txt.
The digest file has the four summary lines from each of the data files,
while the summary file has a one line condensed version of this, and is
the data presented in tabular form on the family pages.
Digest file entry for RdE+25-3-2.2
RdE+25-3-2.2: Random-Edges-Complete N=25 K=3 G=-2.2
Level Totals: Max/Avg/Med=2/1.54/2; Nodes=25, Edges=62
Averages[25]: PL/DFst/BFst/Min=1.32/1.36/1.32/1.88; Path-Max/Avg/Med=3.12/2.06/2.0
Small World Averages[25]: cCoef/charPath/Diam=0.26/2.0/4 |
|
Summary file entry for RdE+25-3-2.2
RdE+25-3-2.2 E=62 L=2/1.54/2 C=0.26/2.0/4 S=1.32/1.36/1.32/1.88 P=3.12/2.06/2.0 |
These summary files are the basis of intra-and inter- family studies,
and studies over all data.
As an example of an intra-family study, the following plots the Power-Law
search values for the Bag series, both before separation by family and
after:
Family Plot: Bag Series Power-Law Search; click for full size
Before separation |
Family Plot |
These "family plots" break a class of network, Bag in this case, into its
various "families", which vary in N, keeping other parameters constant.
Thus in the Power-Law search above, the family plot on the right shows
the 10 varieties of Bag datasets.
An example of inter-family plot is the "class plot" which collects the
11 families together in a single plot. Continuing with the Power-Law
search example, the following plots the Power-Law data for all families,
both for all datasets, and for the 2000 node datasets only:
Class Plot: Power-Law Search; click for full size
All datasets |
2000 node datasets
|
As an example of an overall study, here is the search data for the 270
2000-node graphs. The data contains the search data over 100 samples
of all three searches (power-law, depth first, breadth first) for each
graph, sorted. Thus the total data represented is 270x100x3=81,000
searches as 3 vertical entries over 270 X axis values. The X axis
is simply the rank order within the sorted dataset. The Y axis is
the number of searches required for the sample node to find a random node,
averaged over the samples within the graph dataset (The "S=" entry in the
Summary line above). It emphasizes the superiority of the power-law
style of search.
2000-Node Graph Search Data; click for full size
linear scale |
log-log scale |
An additional 1000 datasets were created for Small World's graphs.
This was done by obtaining 500 random number's between 0 & 0.50 (0-50%)
and using these for rewiring probabilities for 300 node, K=6, SmW+ and
SmW- graphs. Both Digest and
Summary
files, as described above, were obtained for these datasets.
The SmW datasets confirm the relationship between Clustering Coefficient
and Characteristic Path as the rewiring probability increases: The path
decreases faster than the clustering. Similarly, the interplay between
the + & - rewiring strategies (adding edges, detaching edges) is as
expected with added edges decreasing clustering less rapidly than detached,
and decreasing path length more rapidly than detached. The path length
is much less dependent on strategy, however, showing only slight difference
in path length change with probability.
Small Worlds Clustering vs. PathLength; Click for full size
linear scale |
log-X scale |
Discussion
Now that we've seen the details of the study, lets go on to an overall
discussion of the data.
