Impact of Sampling Design

in this post i review the paper [1] and i would like to reference more information related to “Sampling of Graph” .

As we know we need the sampling to estimate the important properties of large networks. As i mentioned in post [2] there are two objective for sampling. therefor based on the goal, the design of sampling will vary. another factor that has impact graph sampling is the query mechanisms.

there are two types of querying mechanisms :

  • backend querying : it is available when on can obtain the value of an entity characteristics from population graph. for example : degree of a sample node refers to its degree in population graph
  • frontend querying : it is available when on can obtain the value of an entity characteristics from the sample graph.  (sampling graph is a representative subgraph).

paper[1] claims that many studies in Internet Topology do not utilize the design considerations. so they conduct a large scale experimental study to demonstrate the relation between various sampling designs and the accuracy of resulting samples in estimating various characteristics of population graphs. the followings are the topological characteristics to study in this paper:

  • degree of a vertexand degree distribution of a graph
  • clustering coefficient : which is a measure of closed connections appearing among vertex groups in a graph and it depends on the number of connected pairs between all neighbors of vertex i.
  • path between nodes i and j :which is a sequence of consecutive edges starting from vertex i and ending at node j.  another important is path length distribution of a graph.

there are several types of graphs, the paper[1] only mentions three types of them which are frequently used.

  • Random Graphs : graphs that are obtained by using Gilbert/Erdos-Renyi model.
  • Graphs with Power-Law Degree Distribution : which are called scale-free graphs and they can be obtained by using Barbasi-Albert model.
  • Small-World Graphs : which are the graphs with a large average clustering coefficient and a small average shortest path. having large clustering coefficient implies the existence of highly clustered cliques in the graph. Also small shortest path refers to the existence of hub vertices or hub cliques within the graph. these graphs are obtained by using Watts-Strogatz model .

in this study these three types of graphs are underlying population graph. in graph sampling applications sampling methods are mainly determined based on the data collection mechanisms. the following sampling methods are used to collect network samples from each type.

  • Random Vertex  : selects a number of vertices with equal probability.
  • Induced Random Vertex : which includes the existing edges from the population graph among the randomly selected vertices .
  • Random Edge : selects a number of edges with equal probability .
  • Induced Random Edge: which includes the existing vertices from the population graph among the randomly selected edges.
  • Random Walk : selects a random vertex as current vertex, then at each step , it selects a new vertex among the immediate neighbors of the current vertex and designates it as the new current vertex.
  • Metropolized Random Walk:  removes the bias introduced by Random Walk. the bias in Random Walk is that is it likely to choose vertices with higher degree, and MRW removes this bias.
  • Random Shortest Path : selects the paths among all possible shortest paths with equal probability. one of the problem of this method is that finding all shortest paths in a graph is costly. the variations of this method are (k,k) sampling which is sampling among k vertices and (k,m) sampling which sampling between k sources and m destinations.


for measuring the difference between the sample and the population distributions the Jensen-Shannon Divergence has been used. the sample size is 10% of each population graph.



[1]Impact of sampling design in estimation of graph characteristics, Emrah CemTozal, M.E. ; Sarac, K. Dept. of Comput. Sci., Univ. of Texas at Dallas, Richardson, TX, USA ;



Internet Topology : AS-Level Graph

In this post I explain the “AS-Level Graph” one of the methods for measuring and visualizing Internet Topology.

As we know Internet is a network of Autonomous Systems. Autonomous Systems are groups of networks sharing the same routing policy, which identified with unique Autonomous system Numbers (ASN).

Autonomous System (AS) Numbers are used by various routing protocols. IANA allocates AS Numbers to Regional Internet Registries (RIRs). The RIRs further allocate or assign AS Numbers to network operators in line with RIR policies. 

Network topology is the arrangement of the various elements such as links, nodes of a computer network. it is topological structure of a network. Internet topology on AS-Level, is the arrangement of ASes and their interconnections. the goal is to analyzing the Internet topology and finding properties of associated graphs rely on mining data and capturing information about Autonomous Systems (ASes). data on the AS-level topology can be inferred using the tables built by the interdomain routing protocol BGP

However, it is still possible to infer such information by exploiting the collateral effects of the Border Gateway Protocol (BGP) and the traceroute tool. In the BGP case, the AS_PATH attribute of UPDATE messages can be used to extract inter-domain connectivity information, although the attribute was originally introduced to prevent routing loops. Similarly, the sequence of IP interfaces obtained via traceroute can be used to infer AS-level connectivity.

C. Faloutsos et al. (see Proc. ACM SIGCOMM, 1999) found that the inter autonomous system (AS) topology exhibits a power-law vertex degree distribution.  in [9] they study and characterize the topology of the Internet at the Autonomous System level and they show that the topology is described with power-laws. before using power-law rule, the metrics that were being used were mainly the node degree and the distance between nodes. some network models were introduced in which the link creation probability depend upon the Euclidean Distance between nodes.(it was good for ARPANET). as internet grow, more detailed models were needed. the first power law observed in network traffic in 1995. also power law describe the topology of peer-to-peer networks, the world wide web and multicast trees.

They monitor and analyze the internet over a period of 5 years and they used RouteView data set and

dia1    power-law2

paper [8] investigates a variety of evolutionary models for AS-Level Internet topology it uses four different methods to generate the initial topology .

  • generating a random graph with specified average degree (using pure random graph where all edges have fixed probability).
  • a snapshot from a real network or other model and then assigning weights to the nodes.
  • incremental with degree preference. it starts from one node and add others one at a time. the edges associated with new node are connected to the existing nodes with probability that is linearly proportional to the degree of existing node.
  • use a pre-assigned degrees for each nodes which are chosen from an exponential distribution.

they also used three different preferences for choosing the edge connectivity; (1) degree of nodes at the time, (2) activity/ number of growth edges when a node is added to graph, (3) age of node.

in paper [10] suggest that for different sampling goal we should use different sampling method. they study two sampling goal for large graphs. these sampling goals are:

1.scale-free sampling goal : in this sampling there is a large static directed graph G with n nodes. also we know the sample size which is n` << n. the goal is to have a sample graph S with similar properties to the original graph G.

2. back-in-time sampling goal : the goal is to mimic the past version of graph G. the sample graph S is a static snap-shot of G. the number of nodes in both graph are the same. the sample graph derives without having any temporal information.

sampling algorithms that are mentioned in the paper are :

1.sampling by random node selection :

  •  uniform-sampling : RandomNode algorithm is used for this purpose in which nodes are selecting uniformly. but the node distribution doesn’t follow the power-law.
  • non-uniform sampling : PageRank algorithm is used in which the probability of a node being selected is set proportional to its PageRank weight.

2. sampling by random edge selection : 

  • uniform sampling (Random Edge sampling) or (Random Node-Edge sampling): in RE we select the edge uniformly randomly. in RNE first a node is picked uniformly randomly, then uniformly at random pick an edge incident to the node.
  • Hybrid sampling : with probability p we perform RNE and with probaility 1-p we perform RE.

3. sampling by exploration : the common idea is to select a node uniformly at random and then explore the nodes in its vicinity.

  • Random Node Neighbor (RNN) : a node with all of its outgoing neighbors is selected uniformly at random.
  • Random Walk (RW) : uniformly at random a starting node is being selected and then simulate walk on the graph. at every step with a probability c , the algorithm flies back to the starting point and restarts the random walk.
  • Random Jump (RJ) : it is similar to the random walk with only difference that at every step with a probability c , the algorithm flies back to any node and restarts the random walk from there.
  • Forest Fire (FF): this algorithm is inspired by the work on temporal graph evolution. the algorithm picks a seed node and begin burning outgoing links and the corresponding nodes. if a link gets burned the node at the other endpoint gets a chance to burn its own links.

this paper derives multiple sampling graphs with different sampling algorithms on several real datasets. the result is showing that for scale-free sampling goal, sampling methods based on random walks performs better. however for back-in-time sampling goal “forest fire” type sampling and sampling based on page-rank score of a node perform best.






[5] Towards Capturing Representative AS-Level Internet Topologies (2002), by Hyunseok Chang , Ramesh Govindan , Sugih Jamin , Scott J. Shenker , Walter Willinger

[6] Traceroute and BGP AS Path Incongruities (2003), Hyun, Y, Broido, A, claffy, k.

[7] John W. Stewart III, BGP4: Inter-Domain Routing in the Internet., (1999)

[8] An evolutionary framework for AS-level Internet topology modeling, (2003) Gao, R.Zegura, E.

[9] Power laws and the AS-level Internet topology, (2003), Siganos, GeorgosFaloutsos, MichalisFaloutsos, P.Faloutsos, C.

[10] Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’06).

BGP Routing Policies

I am going to explain BGP Routing Policies, because the main dataset for my research is from BGP data/AS-Relations/AS-Paths.

There are three type of relationships between ASes:

  • customer-to-provider/provider-to-customer : The justification for this classification is that an AS must buy transit services for any traffic destined to parts of the Internet that this AS neither owns nor can reach through its customers. ASes at lower levels pay ISPs at higher levels in exchange for access to the rest of the Internet. The customer ISP pays the provider ISP for transit. Links between a customer and a provider are c2p (p2c) links.
  • peer-to-peer : A p2p link connects two ISPs who have agreed to exchange traffic on a quid pro quo basis. Peers exchange traffic only between each other and each other’s customers. This relationship allows peering ISPs to save money on transit costs they would otherwise have to pay to their transit providers for such traffic. In Figure 1, B-C is a p2p link. Note the lack of direction of the link, indicating that neither B nor C is paying each other for the traffic they exchange.
  • sibling-two-sibling : connects two ASes administratively belonging to the same ISP. Such links usually appear as a result of mergers and acquisitions, or under certain network management scenarios

As I mentioned in the previous post, ISPs can own Multiple ASes. Three common relationships between ISPs are:

  • customer-provider: where one ISP pays another to forward its traffic,
  • peer-peer: where two ISPs agree that connecting directly to each other (typically without exchanging payment) would mutually bene- fit both, perhaps because roughly equal amounts of traffic flow between their networks,
  • backup relationships: where two ISPs set up a link between them that is to be used only in the event that the primary routes become unavailable due to failure.

Border Gateway Protocol (BGP) routers of an ISP, typically receive multiple paths to the same destination. The BGP best path algorithm decides which is the best path to install in the IP routing table and to use for traffic forwarding based on the policies. the range of these policies constitutes a huge space. The policies can be divided into four groups:

1. Business Relationship : these policies are used to reflect the relationships between ISPs.

  • Influencing the decision process (by assigning LocalPrefs attribute): ISPs often prefer customer-learned routes over routes learned from peers and providers when both are available
  • Controlling route export (by using the community attribute): Routes learned from providers or peers are usually not exported to other providers or peers, because there is no economic incentive for an ISP to forward traffic it receives from one provider or peer to another. one way to separate the paths learned from peers, is to assign a tag.

2. Traffic Engineering: they reflect the attempts to reduce delay, improve reliability, increasing the number of available routers.

  • Outbound traffic control (by changing LocalPref and IGP costs):
    • One goal is early-exit routing (also called hotpotato routing), where the ISP forwards traffic to its closest possible exit point, so as to reduce the number of links packets traverse and hence the resulting congestion in its internal network.
    • other goal is load balancing by changing LocalPref. The key challenge is to select the proper set of prefixes and change attributes for each appropriately; selecting too large a set will cause too much traffic to shift, overloading one of the links. It can also be tedious to express a long list of prefixes in a router configuration file. Some ISPs deal with this by changing preference for all prefixes whose AS path matches a regular expressions to control how many prefixes match it.
  • Inbound Traffic control (AS Prepending and MED):  there should be mechanism to allow an ISP to control how much traffic it receives from each of its peering links is essential. However, it requires the local ISP to influence route selection in remote ISPs, which in turn might wish to limit or completely ignore the local ISP’s goals. But, an ISP may convince its neighbor by using MED. Also an AS can shift the traffic by prepending multiple copies of its AS number to the AS path in order to artificially inflate the AS-path length.
  • Remote control (community attributes): assume A and B are neighbor ASes. A can allow B to control A’s routing policy with respect to B’s routes by configuring its routers to map certain community attributes to certain LocalPref values. Remote control has some overlapping functionality with other mechanisms to control inbound and outbound traffic. remote control is typically used to allow a customer to tell its provider to perform some action on its behalf.

3. Scalability:  Sending updates too frequently can trigger route instability, leading to poor service quality, or can overload a router’s processing capability or memory capacity, which can cause outages and router failure.

  • Limiting routing table size (by filtering and using the community attribute): by: (a) Filtering long prefixes (e.g., longer than /24) to encourage use of aggregation. (b) As a safety check, routers often maintain a fixed per-session prefix limit that limits the number of prefixes a neighbor can advertise. (c) Default routing for  an ISP with a small number of routes.
  • Limit the number of routing changes (by suppressing routes that flap): it works by maintaining a penalty value associated with the route that is incremented whenever an update is received. When the penalty value surpasses a configurable threshold, the route is suppressed for some time.

4. security-related: . By sending false information, an ISP can subvert a neighbor’s routing goals, cause routers to overload and fail, or degrade service quality.

  • Discarding invalid routes (by import filtering): filtering routes such as:
    • routes to special-use or private addresses, or address blocks that have not yet been allocated are obviously invalid.
    • Advertisements from customers for prefixes they do not own should not be propagated.
    • ISPs can also perform certain sanity checks on the AS path; for example a Tier-1 ISP should not accept any routes from its customers that contain another Tier-1 ISP in the AS path.
    • advertisements containing private AS numbers in the AS path
  • Protect integrity of routing policies (by rewriting attributes): To defend against violations of peering agreements, the ISP can configure the import policy to delete attributes or overwrite them with the expected values
  • Securing the network infrastructure (by export filtering):  For example:
    • filtering the IP addresses used to number the router interfaces.
    • filtering the addresses of the hosts running network-management software.
    • export filtering of invalid routes
  • Blocking denial-of-service attacks (by filtering and damping):some examples are:
    • ISP can configure each BGP session with a maximum acceptable number of prefixes, tearing down the session when the limit is exceeded;
    • the import policy could filter prefixes with large mask lengths (e.g., longer than /24).
    • Upon detecting the excessive BGP updates, the operators could modify the import policy to discard advertisements for the offending prefixes or disable the BGP session.

flowchart for BGP best path Algorithm:


References :

[1] BGP Routing Policies in ISP Networks, (2005) , Caesar, M. Rexford, J., California Univ., Berkeley, CA, USA ;




Internet Topology

Internet Topology is the structure of how hosts, routers or Autonomous systems are connected to each other.  The topology can can be constructed for different layers, but most important layers are :

  • An Internet Protocol (IP) interface-level graph is constructed by extracting IP links directly from the traceroute output: two IP addresses are inferred to form a link if they were observed adjacent to each other in a traceroute output.
  • the router-level Internet topology  is a graph where vertices are routers and edges are one-hop connectivity
  • the autonomous-system level Internet topology is a graph where vertices are autonomous systems and edges represent peering agreements

the Data Sources that can be use for Internet Topology are :

  1. BGP routing tables : they provide information on the AS Internet topology. They are collected from special BGP collectors. These collectors don’t advertise any prefix, they just receive all the routes that internet backbone routers advertise to them. it is passive method.
    1. Reliable
    2. Limited number of vantage points – no complete view of as-level topology
    3. Uses preferred path , so no complete view
    4. Peer-to-peer links can be seen in a BGP routing table of these two peering ASes or their customers-
    5. BGP table dumps miss alternative and back-up paths

      6.BGP updates may provide usperimposition of a number of idfferent snapshots that existed at some point in time

  2. Traceroute : Traceroute is a program to discover the route that IP datagrams follow. it takes the advantage the fact that each router has to decrement the TTL by 1 for each IP packet pass through it and each router has to discard any IP packet with TTL=0 and send an ICMP message to the sender. This data reflects the router-level topology and it is an active method. some well-known tools are Skitter, RocketFuel, NetDimes. in order to use the data for AS-level topology, the IP-to-AS mapping approaches should be used.
    1. reliable
    2. traceroute server explores the routing paths from its location towards the rest of the world
    3. same limitations as BGP data in terms of completeness and link bias
    4. additional challenge is mapping the IP to an AS path
  3. Internet Routing Registry (IRR) : The Internet Routing Registry (IRR) is a distributed routing database development effort. Data from the Internet Routing Registry may be used by anyone worldwide to help debug, configure, and engineer Internet routing and addressing. The IRR provides a mechanism for validating the contents of BGP announcement messages or mapping an origin AS number to a list of networks. ASes use the Routing Policy Specification Language (RPSL) to describe their routing policy. by using the data from this source AS-Level relations can be extracted.
    1. IRR information is manually maintained so they are prune to human errors
    2. however up-to-date IRR entries provide a wealth of information that couldn’t be obtained from any other source .




[3] Internet Topology, Yihua He, Georgos Siganos, Michalis Faloutsos, University of California, Riverside

Census and Survey of the Visible Internet – Part 1

In this post I will discuss the paper “Census and Survey of the Visible Internet” by John Heidemann, Yuri Pradkin, Ramesh Govindan, Christos Papadopoulos, Genevieve Bartlett, and Joseph Bannister, available at

the first-part of paper: general idea and basic definitions,

As we discussed in the previous posts, we want to study the real network such as Internet. this paper wants to find out about the internet structure. it addresses two approaches that are being used “censuses” and  “surveys“. Their contributions are:

  1. to be the first one to use “censuses” approach.
  2. to evaluate the effectiveness of “surveys” that frequently probes a small fraction of the edge of the network.
  3. exploring the host-level internet
  4. estimate the characteristics of the internet such as evaluating typical address occupancy, dynamic address usage, the median active address is continuously occupied, estimating size of stable internet, …
  5. study trends in the deployment of firewalls on the public Internet

what is census and survey methodology?! these both are statistical sampling tools.

census : requires enumerating all members of the population. which means study the whole internet. As we can know it has several challenges :

  • the large number of addresses to be probed
  • how to interpret the results of probing
  • being mistaken as malicious scans
  • effects of lost probes on the results

surveys: it considers a sample. it probes only a subset of addresses. the challenges here are:

  • who is sampled and how often
  • how to be ensure that the sample is large enough to represent internet

the approach of the paper  : they use two approaches to get the best of them. the shared concept among census and survey is the “probing” the addresses.

probing design:

  •  request: for each address, a single probe message is sent and the time has been recorded until the reply is received. the protocol using for probe is ICMP echo-request messages. at first they started to probe by TCP protocol by they got “abuse complaints”.
  •  replies: they categorized the replies in three classes : positive acknowledgment (type 0), negative acknowledgment(type 3, code 9, 10, 11 and 13), no reply
  • request frequency : it means how often this probe is happening , in census just once and in the survey multiple-times over a time-window.
  • Implementation requirement : it should enumerate the internet address space completely which means dispersing probes  to any block across time in a random order. (the private address space and multicast addresses are excluded), they use IANA’s list for routable addresses.  

census design: they probe all allocated addresses. they run the census from two sites, as fast as possible, limited by a fixed number of outstanding probes.

survey design : there are two steps here :

(1) selecting probe frequency of each address : they studied probing intervals as small as 5 minutes and then they chose an interval of 11 minutes. they select a survey size of about 1% of the allocated address space. (I will explain the methods they used to estimate these parameters in the next post)

(2) selecting the sample of addresses to survey : they probe /24 blocks to achieve studying the group of addresses. this is based on the fact that CIDR and BGP routing exploit common prefixes to reduce routing table sizes. and they often share similar patterns of packet loss. they use three different policies to select the blocks:unchanging/random, unchanging/spaced, novel/random

  •        unchanging/randomly policy: half of the blocks are selected when they began surveys in Sept. 2006 and retain them in the future. the half of these blocks were selected from all blocks that had any positive responses. this half is equal to a quarter of all blocks .another quarter of all blocks cover a range of availabilities and volitilities.
  •        unchanging/spaced quarter ensures that unusual blocks are represented in data.
  •        novel/random : the remaining half of all blocks  are selected randomly for each survey , from the set of /24 blocks that are responded in the last census.

(3) how long: they collect the surveys for perdios of about one week

(4) data-sets: 4 general surveys and one ICMP-nmap survey.

Metrics: after reviewing the general idea of new-methodology, we have to look back to the goal of this paper which is characterizing the visible internet. to achieve this goal, two metrics has been introduced in the paper:

(1) availability  of an address . A(addr) . is the fraction of the time a host at an address responds positively. A(block) is the mean of A(addr) for all address in the block.

(2) uptime of an address. U(addr) . is the mean duration for which the address has a continuous positive response, normalized by the duration of probing interval. U(block) is the mean of U(addr) for all address in the block.

the validation of the methodology and the evaluation of the results will be on explained in the next post.

Modeling Networks

The first step to study the properties of networks, is to model the networks. I will review the different types of modeling, the first one is Random Networks/Graphs.

what is random network/graph? it is not a single graph, it is a statistic ensemble where each member has its own given probability of realization(statistical weight). in other words it is a graph with one probability, another graph with other probability. to obtain some quantity, characterizing a random network, we should collect the full statistics for all members of the statistical ensemble.

there are different types of random graph :

  • classic random graph (Poission random graph)
    • Erdos-Renyi Model : we set an edge between each pair of nodes with equal probability, independently of the other edges. These models can be used in the probabilistic method to prove the existence of graphs satisfying various properties, or to provide a rigorous definition of what it means for a property to hold for almost all graphs
    • Gilbert model : take a given number N of labelled nodes, say i=1,2,3,…, N and connect each pair of nodes with a given probability p, we will have “labeled graphs“. the following picture shows labeled graphs for N=2 and N=3. each labeled graph has a probability following the formula, where m is number of edges.
    • labeled-graph
    • c3002785300c02f59ec981ae4713f0f1
  • Generalized random graphs : 
    • the configuration model (power-law degree distribution): directly generalize the Erdos-Reyni graph, by building maximally random network for a given degree distribution. assume we have a set of degree sequence, which is a set of n values of the degree of vertices. we construct a random graph by choosing uniformly matching on the degree “stubs” (half-edges).

I will explain each type in details including the properties of each type. now i have to practice my C++ programming skills 🙂 (C++ Primer, by Stanley B. Lippman, Josee Lajoie and Barbara E. Moo)

References :

[1] Lectures on Complex Networks by Sergey N. Dorogovtsev

[2] Network Analysis and Modeling, CSCI 5352 Lecture 11 Prof. Aaron Clauset


[4] the structure and function of complex networks by M. E. J. Newman


Networks and Graphs,

notice : the assumption is that if you are here and you are reading this post, you know the basics about networks and graphs.

As We know that A Network is a set of items (nodes) with connections between them (edges). In Mathematical literature we call it Graph. [just a small reminder that father of Graph Theory is Euler]. there are different types of Graphs, such as directed, undirected, hypergraphs,  weighted, unweighted, complete, trees, … Also there are different types of real network such as social networks, information networks, biological networks, …

Networks and Graphs, have some interesting properties:

  • small world effect  : the most pairs of nodes in most networks seem to be connected by a short path through network. to see if this is a valid claim we have to find all the shortest paths among the nodes, then calculate the mean! it seems simple, BUT it will cost a lot, (memory, computation).
  • clustering or transitivity : if node A is  directly  connected to node B and node B is directly  connected to node C, so there is a high probability that node A is directly connected to node C. transitivity means the presence of a heightened number of triangles.
  • degree distribution : the degree of a node is the number of edges connected to that node. if we plot the histogram of degrees of nodes, we can see the degree distribution. in real networks the distribution is following power law (scale-free networks).
  • network resilience : it is a measurement of network’s function based on the connectivity of nodes.
  • patterns : such as which nodes pair up with other nodes (core nodes). finding different types of nodes, measuring “assortativity coefficient ” between different types of nodes, etc
  • degree correlation : it is a special case of assotative mixing according to the scalar vertex property. which means according to vertex degree, and it will answer questions such as : do high-degree nodes associate with other high-degree nodes or they are connected mostly to low-degree nodes, ….
  • community structure : it is used mostly in social networks. it refers to groups of nodes that have high density of edges within them. we can find them by cluster analysis.
  • network navigation :  navigating in the real networks is not an easy task and depending on the netwrk type it can be even impossible.

there are more properties that I will mention them later.


[1] the structure and function of complex networks, by M. E. J. Newman