null  null
R A C I E R - A Hierarchical A p p r o a c h for Content
Internetworking T h r o u g h Interdomain R o u t i n g
Xiaojuan Cai
M.E., Southeast University, China, 1995
B.E., Southeast University, China, 1992
Master of Science
(Department of Computer Science)
we accept this thesis as conforming
to the required standard
T h e University of B r i t i s h C o l u m b i a
September 2002
© Xiaojuan Cai, 2002
In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t of the requirements
f o r an advanced degree a t the U n i v e r s i t y of B r i t i s h Columbia, I
agree t h a t the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e
and study. I f u r t h e r agree t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g of
t h i s t h e s i s f o r s c h o l a r l y purposes may be g r a n t e d by the head of my
department or by h i s or her r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t
c o p y i n g or p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l not
be a l l o w e d w i t h o u t my w r i t t e n p e r m i s s i o n .
The U n i v e r s i t y of B r i t i s h Columbia
Vancouver, Canada
Nowadays, Internet traffic is mainly comprised of the distribution of static and med i a rich content. W h i l e ad hoc mechanisms may extend the reach of content, Content
Internetworking (CI) attempts to foster the interoperability of distinct, independent
Content Networks (CNs). Unfortunately, known C I content routing approaches have
scaling problems and partly duplicate existing network layer functions. I n this thesis,
we propose a new R o u t i n g Architecture for Content IntERnetworking ( R A C I E R ) .
B y extending existing I P networking infrastructures to incorporate content intelligence, an Internet built on the new routing architecture can be both I P and content
aware. T h e key features of the architecture are hierarchy and parallel IP-address
based routing and name based routing, which lend themselves well to scalability and
performance, while maintaining compatibility w i t h traditional Internet architecture.
T h e Border Gateway P r o t o c o l ( B G P ) is extended to support the content
internetworking i n our approach. W e implemented the system prototype based on
the M u l t i t h r e a d R o u t i n g Toolkit ( M R T ) and conducted preliminary experiments to
verify our system architecture.
List of Tables
List of Figures
Thesis Contributions
Thesis Organization
Background and Related Work
Content Networks and Proprietary Solutions
Traditional Content Access Accelerators
Content Networks (CNs)
Sample Proprietary Solutions for C N
Internet Draft for Content Internetworking
Motivation of C N Internetworking
Request R o u t i n g ( R R ) Internetworking
A Glance of B G P
Content Internetworking Architecture — the Big Picture
System Architecture
Subsystem Descriptions
C N Internal Content Request-Routing System
Content Gateway Interconnection P r o t o c o l ( C G I P )
Content Gateway and Border Gateway Interconnection P r o -
tocol ( C G B P )
Content Border Gateway P r o t o c o l ( C B G P )
Route Caching at C I G
Content Request Resolution Process
Design of C B G P
Features of C B G P
Overview of Content Announcement and W i t h d r a w a l
Content Update Message Format
Processing of the Content U p d a t e
A P r e l i m i n a r y Solution for Content R o u t i n g Decision
Content Route Preference C a l c u l a t i o n
Content R o u t i n g Decision Process
Design w i t h a Refined M e t r i c for Network Conditions
P a t h Latency Measurement
R o u t i n g Decision Improvement without Changing I P R o u t i n g
R o u t i n g Decision Improvement by Influencing I P R o u t i n g
More Design Issues
Controlling the Content U p d a t e Frequency
Usage of Network Latency Measurement
Network P r o x i m i t y Inside an A S
Convergence Process Discussion
Deployment Considerations
Implementation and Experiment
Implementation Based on M R T
Implementation Overview
Introduction of M R T
Content Internetworking Implementation
Topology and Configuration of the Experiment
General Description
Experiment Network Topology
Evaluation Criteria
D a t a Collection and Analysis
Conclusion and Future Work
Future Work
List of Tables
I P routing table for the sample topology
Content route table for the sample topology
I P routing table w i t h influence of L A T E N C Y
Content route table w i t h influence of L A T E N C Y
T h e configuration of the testing environment
D N S QoS assessment i n the internet
Content route table i n interoperability test
List of Figures
Structure of a CN
T y p i c a l user interaction w i t h an Akamaized web site
Request routing internetworking system architecture
System architecture
C B G P routing process
C O N T E N T attribute format
A typical router architecture
A sample network topology
Phase 1 of content route u p d a t i n g
Phase 2 of content route u p d a t i n g
Network topology for the experiment
L o a d comparison on two similar configured machines
Response times for a single client
Response times for two concurrent clients
Interoperability testing topology
F i r s t of all, I would like to express my gratitude to my supervisor, D r . Son Vuong,
for his essential guidance, inspiration and suggestion.
W i t h o u t h i m , this thesis
would have been impossible. I a m grateful to D r . A l a n Wagner for being my second
reader and giving me useful comments for improvement.
T h e financial support from the Computer Science Department at U B C i n the
forms of T A and R A as well as the p a r t i a l support from the C I T R Network of Centers
of Excellence for my involvement i n the S C E project are gratefully acknowledged.
I would also like to thank a l l m y family members for their consistent support
during my graduate study here.
T h a n k s also go to my project partner M i n g G a n and a l l other friends at
The University of British Columbia
September 2002
Chapter 1
T h e past several years have seen the explosive growth of the Internet, especially
the W o r l d W i d e Web. T h e primary use of the Internet is content delivery, such as
web pages, images, and increasingly, audio and video streams. Some measurements
indicate that 70 to 80 percent of Internet traffic is H T T P traffic (and most of the
rest of the traffic is for routing and D N S ) [11]. T h a t is, almost a l l of the traffic i n
the wide area is delivery of content, and ancillary traffic to locate it and determine
how to deliver it. Today, millions of clients are accessing thousands of web sites
on a daily basis, w i t h the top 20 web sites supplying about 10 percent of the total
content [11]. A s the scale and use of the Internet increase, web content providers
find it increasingly difficult to serve a l l users wishing to access their content w i t h an
appropriately low response time, especially i n the face of unexpectedly high loads
(often called "flash crowds"). T h i s is the case despite the massive additions i n
bandwidth across the whole Internet [1]. Content Networks (CNs) have become a
popular means of addressing this problem; they typically deploy multiple surrogates
i n the Internet, allowing them to offload the management of these surrogates from
i n d i v i d u a l content providers. T h i s offers economies of scale a n d greater ability to
respond to high loads. CNs also t r y to decrease latency for i n d i v i d u a l clients by
satisfying client requests from nearby surrogates. W i t h these benefits, C N s are of
increasing importance to the overall architecture of the Web.
Despite a l l the advantages, CNs have their own limitations. It is extremely
capital-intensive and operationally complex to achieve the scale necessary to be
present at the edge of every network or region of the world. E v e n where they have
content replicas close to a particular network, demand may outstrip capacity from
time to time or failures may occur. Moreover, each C N also has its own physical
scope a n d content capacity limitation. Clearly, the ability to foster interoperability
of independent CNs through open standards to extend the reach of content can offer
additional flexibility a n d expand the market for content networking more rapidly.
T h i s falls into the task category of "Content Internetworking" (CI), which aims
at improving scale, fault tolerance a n d performance for i n d i v i d u a l CNs. Content
Internetworking is a n emerging technology that has promising features to improve
Internet access experience.
I E T F working standards of CI m a i n l y consist of the internetworking of three
subsystems: Content Distribution [ 6 ] , Content Request-Routing [5] a n d Accounting
[7]. O u r research focus is on the Request-Routing. T h e Content Request-Routing
system redirects client requests for content to a n appropriate server, selecting from
the origin server and surrogate servers holding that content.
T h e choice is made
according to which one is expected to best serve the client i n terms of latency
attributes, such as network proximity, server health a n d load, content availability,
b a n d w i d t h usage and congestion. W h i l e (77V internal request routing techniques
have been explored a n d developed extensively i n the content networking industry,
Content Request Routing i n CI still remains a compelling technology for research
and is the focus of our efforts i n this thesis.
A l t h o u g h the I E T F C D I work group specializes i n CI, a n d other research
groups also work o n similar topics, most of the existing work focuses on addressing
the problem at the application layer. W e believe addressing the problem on the
application layer constitutes a duplication of network layer routing infrastructure
and thus violates the principle of clean layering architecture. Some other approaches
[11,12] delegate content routing to the network layer a n d adopt a pure name based
routing scheme as the only routing mechanism on the Internet. T h i s is not practical
from the deployment viewpoint and is difficult to scale.
T h i s thesis focuses o n a new approach to solve the request routing internetworking problem i n the CI, aiming at overcoming most of the defects of the existing
Thesis C o n t r i b u t i o n s
A novel hybrid R o u t i n g Architecture for Content I n t E R n e t w o r k i n g (RACIER)
proposed i n this thesis that combines the strengths of the previous approaches while
overcoming their disadvantages. I n our approach, content internetworking is delegated to the network layer to fully exploit existing routing infrastructures. RA CIER
seamlessly supports b o t h IP-address a n d name-based routing; these two mechanisms
function i n parallel on the Internet.
We adopt a hierarchical system architecture to achieve scalability, which
consequently enables a hierarchical request resolution process. G i v e n the fact that a
CN is normally under the management of a single set of policies and operated by a
single internet service provider (ISP), it is reasonable to assume that most CNs are
confined to the boundary of one Autonomous System ( A S ) . I n the few cases where a
CN encompasses multiple A S s , it could have multiple gateways residing on different
A S s , serving the purpose of interconnection. We therefore b u i l d a n internetworking
architecture w i t h i n the framework of A S interconnections.
O u r Architecture accommodates b o t h I P a n d content routing, a n d is a superset of B G P 4 [8]. T h i s makes it compatible w i t h the existing I P routing architecture and infrastructure a n d is easy to deploy. T h e dual addressing mechanism of
Name+IP we adopt i n the routing process utilizes the features of existing network
functionality efficiently a n d greatly reduces the round trip time for content request
T h i s hybrid routing architecture along w i t h hierarchical routing a n d
resolution scheme make request routing work effectively, which greatly reduces the
size of the content route table for i n d i v i d u a l CNs.
In our approach we choose a distributed routing scheme over a centralized
one, i n which original content servers are treated equally as a replicated server.
Thus, we don't have an authoritative request routing system as i n [5], a n d thus
avoid the performance a n d scalability bottleneck.
Furthermore, i n our design, only when a client requests content that is not
replicated across the globe, and whose origin server is located i n a remote A S , w i l l
the request be redirected to a normal D N S [9, 10] resolution chain. A c c o r d i n g to
statistics [11], a relatively small number of web sites comprise a large proportion
of the traffic. Discriminating between the handling of replicated content a n d nonreplicated content further reduces the size of route tables and allows for
content routing. O u r approach significantly reduces the response time for accessing
replicated content, especially for the most popular content sites, the routes to which
are likely to be cached on local CIGs. Furthermore, our experiment shows that our
approach only incurs negligible round t r i p time overhead for accessing non-replicated
content, as compared to normal D N S resolution performance.
O u r approach thus
lends itself well to performance a n d scalability, while maintaining compatibility w i t h
the traditional Internet architecture.
T o prevent the overloading of a single "best" server, we also explore a n d
embed a global load-balancing scheme, by supplying multiple good routes to a piece
of content simultaneously, a n d by using a weighted random selection algorithm to
improve the overall load distribution.
A complete suite of protocols for our content internetworking architecture
have been developed, including the following:
Content Border Gateway Protocol
( C B G P ) , Content Gateway Interconnection Protocol ( C G I P ) , Content Gateway Border Protocol ( C G B P ) , encompassing inter-AS, inter- CN a n d CN-AS communication, respectively. We have also implemented a system prototype based o n a routing
simulation toolkit called Multi-threaded Routing Toolkit ( M R T ) [14], a n d conducted
experiments for system evaluation a n d analysis.
Thesis Organization
T h e thesis comprises of 6 chapters.
In Chapter 2, the background information is introduced which includes the
definition a n d composition of CNs, the proposed draft for the solution of C I from
I E T F C D I group, a n d brief introduction to B G P . Chapter 3 describes our hierarchical system architecture a n d the content request resolution process. A detailed
design about the interdomain content routing for the content internetworking a n d
routing decision process is provided i n Chapter 4. Chapter 5 gives the prototype
implementation and experiment setup w i t h performance evaluation and analysis.
Chapter 6 concludes the thesis and discusses future work.
Chapter 2
Background and Related Work
Content Networks and P r o p r i e t a r y Solutions
We introduce the origin of C N s , the composition of a C N and a few proprietary C N
solutions i n this section.
Traditional Content Access Accelerators
T h e past several years have seen the evolution of technologies centered around "content." Protocols, appliances, and entire markets have been created exclusively for
the location, download, and usage tracking of content. Some sample technologies i n
this area have included web caching proxies, content management tools, intelligent
"web switches", and advanced log analysis tools.
These technologies typically play a role of solving the "content delivery problem" . Abstractly, the goal i n solving this problem is to arrange a rendezvous between
a content source at an origin server and a content sink at a viewer's user agent. In
a t r i v i a l case, the rendezvous mechanism is that every user agent sends every request directly to the origin server named i n the host part of the U R L identifying
the content [1].
A s the audience for the content source grows, so do the demands on the origin
server. T o achieve better performance, the apparent single logical server may i n fact
be implemented as a large "farm" of server machines behind a switch. B o t h caching
proxies (acting on behalf the clients' side) and reverse caching proxies (acting on
behalf of the server side) can be deployed between the client and server, so that
requests can be satisfied by some cache instead of by the original server. I n the
caching approach, an I S P can also deploy regional parent caches to form a hierarchy
that further aggregates user requests and responses.
B o t h server farms and hierarchical caching are useful techniques, but have
limits. Server farms can improve the scalability of the origin server. However, since
the multiple servers and other elements are typically deployed near the origin server,
they do little to improve performance problems that are due to network congestion.
Caching proxies can improve performance problems due to network congestion (since
they are situated near the clients), but they cache objects based on client demand.
Caching based on client demand performs poorly i f the requests for a given object,
while numerous i n aggregate, are spread t h i n l y among many different caching proxies
[1]. Also, caches don't provide enough control over what data is actually served by
Thus, a content provider w i t h a popular content source may find that it
has to invest i n large server farms, load balancing, and high-bandwidth connections
to keep up w i t h demand. E v e n w i t h these investments, the user's experience may
still be relatively poor due to congestion i n the network as a whole. To address
these limitations, C D N (Content Delivery Network) or CN (Content Network) [1],
which is a special type of network regarding content delivery, has been deployed i n
increasing numbers in recent years.
Content Networks (CNs)
A CN essentially spreads server-farm-like configurations out to network locations
more typically occupied by caching proxies. A CN has multiple replicas of each
content item being hosted. A requestfroma browser for a single content item is
directed to a "good" replica, where "good" usually means that the item is served
to the client quickly compared to the time it would take to fetch itfromthe origin
server, with appropriate integrity and consistency. A CW typically incorporates
dynamic information about network conditions and load on the replicas, directing
requests to a good replica and to balance the load.
A CN has some combination of a content-delivery infrastructure, a requestrouting infrastructure, a distribution infrastructure, and an accounting infrastructure. The content-delivery infrastructure consists of a set of "surrogate" servers [22]
that deliver copies of content to sets of users. The request-routing infrastructure consists of mechanisms that move a client toward a rendezvous with a surrogate. Many
request-routing systems route users to the surrogate that is physically "closest" to
the requesting user, or to the "least loaded" surrogate. However, the only requirement of the request-routing system is that it route users to a surrogate that can serve
the requested content [1]. The distribution infrastructure consists of mechanisms
that move content from the origin server to the surrogates. Finally, the accounting
infrastructure tracks and collects data on request-routing, distribution, and delivery
functions within the CN [1].
Figure 2-1 is a diagram depicting a simple CN as described above:
Compared to using servers and surrogates in a single data center, a CN is
(1) client's
(2) response
location of I——-—
(3) Client opens
connection to retrieve
Figure 2 . 1 : Structure of a CN
a relatively complex system encompassing multiple points of presence, i n locations
that may be geographically distant.
Operating a CN is not easy for a content
provider, since a content provider needs to focus its resources on developing highvalue content, not on managing network infrastructure. Instead, i n a more typical
arrangement, a network service provider builds and operates a CN, including the
request-routers, the surrogates, and the content distributors. T h i s service provider
establishes (business) relationships w i t h content publishers and acts on behalf of
their origin sites to provide a distributed delivery system. T h e value of that CN to a
content provider is a combination of its scale (the increased aggregate infrastructure
size) and its reach (the increased diversity of content locations).
Some CNs create their network across other providers' physical network,
and these CN providers must be cautious of the quality of the facilities they use.
Because they don't control the networks, or data centers that host their equipment,
they must be diligent i n evaluating quality and effectiveness. If they are unable to
negotiate their requirements, they need to move elsewhere, which can be difficult
and costly. So, over time, pure p l a y C D N p r o v i d e r s , such as A k a m a i , are likely to be
acquired by a larger service provider that wants to enhance its services suite.
Sample Proprietary Solutions for C N
Most CNs constructed today do not rely on overly complex or intricate technologies.
A k a m a i ' s service was p r i m a r i l y based on homegrown applications and technologies,
while other providers' services were built using many standard products from companies such as Cisco, Nortel, and Inktomi.
We give a brief introduction about
solutions provided by Cisco and A k a m a i i n the following.
• Cisco D i s t r i b u t e d Director
Cisco's Distributed Director (DD) is its m a i n product to support CN. D D
performs load distribution and content routing i n a manner that accounts for relative client-to-server topological proximities ("distances") to determine the "best"
server [28]. It uses the Director Response P r o t o c o l ( D R P ) to gather the route table
information from Cisco routers that are D R P enabled [28].
D D acts as the primary D N S caching name server for a specific host name
or subdomain.
It redirects a name lookup from the m a i n site to a replica site
closer to the requesting client address, making "network intelligent" load distribution
decisions without considering the server side metric [28].
W i t h this product, clients have to incur the response time penalty of accessing
this m a i n site D D before being directed to a closer site.
• A k a m a i Solution
A k a m a i uses its proprietary technology called FreeFlow to deliver content.
Figure 2-2 illustrates a typical user interaction w i t h a FreeFlow-enabled website.
First, the user's browser sends a request for a web page to the site. In response,
the web site returns the appropriate HTML code as usual, the only difference being
that the enclosed embedded object URLs have been modified to point to the Akamai
network. As a result, the browser next requests and obtains media-rich embedded
objects from an optimally located Akamai server, instead of from the home site [20].
1. User enters
standard U R L
2. Htm] with embedded
URLs pointing to Akamai
4. Rich
web server
3. Browser
request embedded
Akamai server
Figure 2.2: Typical user interaction with an Akamaized web site
The FreeFlow DNS system makes fast delivery of the requested content by
resolving each * server name, which represents the customer content
server, to the IP address of the Akamai server that delivers the requested content to
the user most quickly. FreeFlow DNS is implemented as a 2-level hierarchy of DNS
Web servers: high-level servers (HLDNS) and low-level
servers (LLDNS). Each HLDNS server is responsible for directing each query it
receives to an LLDNS server that is close to the requesting client. The LLDNS
servers perform the final resolution of IP name to server address, directing each
client to the Akamai server best located to serve the client's requests. FreeFlow
DNS continuously monitors network conditions and the status of each server [20].
Internet Draft for Content Internetworking
We introduce the proposed solution for C I from I E T F C D I group i n this section.
Motivation of C N Internetworking
There axe limits to how large the scale and reach of any one network can be. T h e
increase i n either scale or reach is ultimately limited by the cost of equipment, the
space available for deploying equipment, a n d / o r the demand for that scale/reach of
infrastructure. Sometimes a particular audience is tied to a single service provider or
a small set of providers by constraints of technology, economics, or law. Other times,
a network provider may be able to manage surrogates and a distribution system,
but may have no direct relationship w i t h content providers. Such a provider strives
for a means of affiliating his delivery and d i s t r i b u t i o n infrastructure w i t h other
parties who have content to distribute. T h e proliferation of content networks and
content networking capabilities brings increasing interest i n interconnecting content
networks so as to provide larger scale and/or reach to each participant than they
could otherwise achieve.
T h e following two subsections introduce the internetworking schema developed by the current Internet working draft [5].
Instead of introducing a l l three
subsystems' internetworking, including distribution and accounting, we focus on
introducing the request routing internetworking.
Request Routing (RR) Internetworking
Request routing internetworking is the interconnection of two or more request routing systems which increases the number of reachable surrogates.
In order for a
publisher's content to be delivered by multiple CNs, each Content Network request
routing system is federated under the Universal Resource Identifier ( U R I ) name
space of the publisher object. T h i s federation is accomplished by first delegating
the authority of the publisher U R I name space to an authoritative request routing
system. T h i s authoritative request routing system subsequently splices each interconnected (it is called "peeing" i n the I E T F draft) content network request routing
system into this U R I name space, and transitively delegates U R I name space authority to them for their participation i n request routing. Figure 2-3 is a diagram
showing the request routing ( R R ) internetworking system architecture. There could
be multiple levels of i n t e r - C N interconnection beyond what is shown i n the sample
architecture of Figure 2-3.
T h e request routing internetworking system is hierarchical i n nature. There
exists exactly one request routing tree for each publisher U R I . T h e authoritative
request routing system is the root of the request-routing tree. There may be only one
authoritative request routing system for a U R I request routing tree. Subordinate to
the authoritative request routing systems are the request routing systems of the first
level peering CNs. There may exist recursive subordinate request routing systems
of additional peering CN levels [4].
T h e actual "routing" of a client request is through R R CIGs. T h e authoritative
request routing CIG , which is a globally centralized node, receives the client requests
and forwards them to an appropriate peering CN.
T h i s process of Inter- CN request-routing may occur multiple times i n a recursive manner between request routing
CIGs u n t i l the request routing system arrives
at an appropriate CN to deliver the content.
(RR Tree Root)
RR system
1st level
Inter-CN RR
— | R R CIG |—|surrogates
Inter-CTV recursive R R
2nd level
Figure 2.3: Request routing internetworking system architecture
In this schema, there may exist multiple levels of off-path (path means the
physical p a t h to the content) redirections between CNs before a name lookup is
finally solved, and this adds a fairly high cost to the content access process. T h i s
is further exacerbated by the necessity of going to the globally centralized authoritative request routing system first, which is a performance bottleneck. I n addition,
advertisements about aspects of topology, geography and performance of a single
CN to other CNs, required i n [5], does not help much i n making the global request
routing decision due to the lack of knowledge of the global topology. T h e feature of
content networks being overlay networks inherently makes decision processes very
complex [5].
A Glance of B G P
T h e Border Gateway P r o t o c o l ( B G P ) is a n inter-Autonomous System (AS) routing
protocol. T h e p r i m a r y function of a B G P speaking system is to exchange network
reachability information w i t h other B G P systems.
T h i s network reachability i n -
formation includes information on the list of A S s that the reachability information
traverses. For this reason, B G P is called a p a t h vector protocol. T h e network reachability information is sufficient to construct a graph of A S connectivity, from which
routing loops may be pruned and some policy decisions at the A S level could be
enforced [8].
B G P 4 is the newest version of B G P and it provides a new set of mechanisms
for supporting classless interdomain routing by advertising an I P prefix w i t h arbitary
length which is less than 32. B G P 4 also introduces mechanisms which allow the
aggregation of routes, including aggregation of A S paths.
B G P uses T C P as its transport protocol (port 179). T h i s ensures that a l l
transport reliability, such as retransmission, is taken care of by T C P and does not
need to be implemented i n B G P itself.
Two B G P routers form a transport protocol connection w i t h each other.
These routers are called neighbors or peers. Peer routers exchange multiple messages
to open and confirm connection parameters, such as B G P version. I n a case of
disagreement between peers, notification errors are sent, a n d the peer connection
does not get established. After the connection is established, a l l candidate B G P
routes are exchanged initially, and later, incremental updates are sent as network
information changes.
Routes are advertised between a pair of B G P routers i n U P D A T E messages.
The U P D A T E message contains, among other things, a list of jlength, prefix^ tuples
that indicate the list of destinations reachable v i a each system.
The U P D A T E
message also contains the p a t h attributes, which include information such as the
degree of preference for a particular route [22]. P a t h attributes fall into four separate
categories [8]:
1. Well-known mandatory.
2. Well-known discretionary.
3. O p t i o n a l transitive.
4. O p t i o n a l non-transitive.
Well-known attributes must be recognized by a l l B G P implementations.
Some of these attributes are mandatory and must be included i n every U P D A T E
Others are discretionary and may or may not be sent i n a particular
U P D A T E message.
A l l well-known attributes must be passed along (after proper updating, i f
necessary) to other B G P peers.
In a d d i t i o n to well-known attributes, each p a t h may contain one or more
optional attributes. It is not required or expected that a l l B G P implementations
support a l l optional attributes. T h e handling of an unrecognized optional attribute
is determined by the setting of the Transitive bit i n the attribute flags octet. Paths
w i t h unrecognized transitive optional attributes should be accepted. If a p a t h w i t h
unrecognized transitive optional attribute is accepted and passed along to other
B G P peers, then the unrecognized transitive optional attribute of that p a t h must
be passed along w i t h the path to other B G P peers [8].
Important well-known attributes include the following:
• A S - P A T H : T h i s attribute identifies the autonomous systems through which
routing information carried i n this U P D A T E message has passed. T h e components of this list can be A S - S E T s or A S _ S E Q U E N C E s .
A S _ S E T s is an
unordered set of A S s a route i n the U P D A T E message has traversed, while
A S - S E Q U E N C E s is an ordered set of A S s a route i n the U P D A T E message
has traversed.
• N E X T _ H O P : T h i s p a t h attribute defines the I P address of the border router
that should be used as the next hop to the destinations listed i n the U P D A T E
• L O C A L _ P R E F : T h i s attribute is a degree of preference given to a route determined by the local policy settings, which is used by routing decision process to
compare w i t h other routes to the same destination. T h e higher the number,
the more favorable the route is.
• M U L T I J E X I T J D I S C : T h i s attribute may be used o n external (inter-AS) links
to discriminate among multiple exit or entry points to the same neighboring
In case of information changes, such as a route becoming unreachable or
a better path emerging, B G P informs its neighbors by w i t h d r a w i n g invalid routes
and injecting new routing information. W i t h d r a w n routes are part of the U P D A T E
message, w i t h a specific value of the message type field.
If no routing changes occur, the routers exchange only K E E P A L I V E packets.
K E E P A L I V E messages are sent periodically between B G P neighbors to ensure that
the connection is kept alive.
A typical path selection process for a B G P router is similar as included i n
the following process: (this is derived from the strategy used by C I S C O routers [24])
1. If the p a t h specifies a next hop that is inaccessible, drop the update message.
2. If the weights are the same, prefer the path w i t h the largest local preference.
3. If no route was originated, prefer the route that has the shortest A S - p a t h .
4. If a l l paths have the same A S - p a t h length, prefer the p a t h w i t h the lowest
origin type (where Interior Gateway Protocol ( I G P ) is lower than Exterior
Gateway Protocol ( E G P ) ) .
5. If the origin codes are the same, prefer the path w i t h the lowest M U L T I _ E X I T _ D I S C
6. Prefer the p a t h w i t h the lowest I P address, as specified by the B G P router
Chapter 3
Content Internetworking
Architecture — the Big Picture
System A r c h i t e c t u r e
A s we mentioned before, the I E T F draft approach requires the world-wide clients
of a site to incur the long round-trip time to go to the authoritative request routing
system first, which i n t u r n redirects the request to other CNs, an unknown number
of times, u n t i l the content request is finally resolved. These long round-trip times
are purely overhead, potentially far higher than the round-trip to the origin content server itself. T h i s issue is fast becoming the dominant performance problem
for clients as Internet data rates move to multiple gigabits, sometimes reducing the
transfer time for content to insignificant lows. T h e name resolution process may
also use congested portions of the network that the content delivery system is otherwise designed to avoid, and the same congested portion can be repeatedly used i n
the process since the peering CNs have no information about network conditions.
Besides having high latency, the authoritative request R R is also a bottleneck for
scalability. Further selecting a good content surrogate usually requires the proximity measurement of a particular piece of content to existing/potential clients to
make routing decisions. Content provider networks must either obtain routing information from routers near their servers, or else make direct network measurements,
which imposes huge traffic increases on the whole network. Meanwhile, aggregate
network information for scalability is still required, which duplicates the existing
routing functions of the network.
Considering that the dominant traffic i n the Internet is content access, we
need to provide an infrastructure that serves this purpose effectively and efficiently.
We keep the following points i n m i n d when creating our design:
• I n order to avoid a performance bottleneck of the whole system, we can not
adopt a centralized scheme; we need to design it i n a fully distributed way for
• We should make full use of the already available network level p r o x i m i t y information i n I P routing because any content access follows the physical p a t h decided by the network layer routing strategy anyway, and we can not impose the
duplication of existing network functionalities, wasting Internet b a n d w i d t h .
• O u r approach should be compatible w i t h the existing Internet infrastructure
for easy and step-by-step deployment.
• O u r approach should be able not only to accelerate the content access to replicated content by selecting a good content server to serve the client's request,
but also to avoid overloading a single selected server; thus, maintaining a
global load balance.
Taking into consideration the nature of content routing and a clean layering
architecture, and w i t h the awareness of large scale routing using interdomain routing infrastructure, we propose an approach that delegates content internetworking
from application layer to network layer. In our approach, the existing interdomain
routing infrastructure is extended to accommodate content routing, which is namebased. T h a t is, i n our system, I P routing and content routing share routing-related
information and function i n parallel.
O u r proposed content internetworking architecture integrates content routing
w i t h traditional I P routing i n a hierarchical way as shown i n Figure 3-1.
Figure 3.1: System architecture
C S - Content Server or Surrogate
For the interdomain connection, we inherit the mechanism of the current
Internet which uses B G P . O u r content internetworking infrastructure integrates w i t h
B G P . T h e Border Gateway i n our approach supports b o t h I P and content routing.
To distinguish itself from the original Border Gateway, which is only capable of I P
routing, ours is called Content Border Gateway ( C B G ) .
The system architecture consists of three levels, named from lowest to highest: the CN internal system level, the CN level, a n d the A S level.
A s we mentioned before, i n a typical CN, a single service provider operates
the request-routers, the surrogates, a n d the content distributors. Therefore, these
CNs usually have a large presence w i t h i n a particular network, confined w i t h i n the
boundary of a n administrative system. I n a case where there exists some overlay
CN encompassing multiple A S s , we can have one CIG of that CN i n each A S to
perform the task of content intnetworking. For the rest of this thesis, we assume
every C N is contained i n a single A S .
In a single A S , there may be multiple CNs. O f these, only Content Interconnection Gateways (CIGs) are visible, a n d C N s communicate v i a CIGs w i t h the
external world. E a c h CIG has knowledge of the contents residing on its local CN.
E a c h C N has one or more CIGs. CIGs w i t h i n one A S communicate using the Content Gateway Interconnection Protocol ( C G I P ) to maintain a consistent view of the
content routing situation i n a local A S . T h e CIG (the representative of a l l C I G s i n
the local A S ) also communicate w i t h a local C B G , to propagate knowledge of the
replicated content residing on the local A S . CIGs learn routing information about
content residing on foreign A S s directly from the CBG when necessary.
CBGs interconnect A S s through Content Border Gateway Protocol ( C B G P ) ,
replacing the original B G P router i n the traditional Internet routing architecture.
CBGs seamlessly accommodate IP-address a n d name-based routing, a n d is
backward compatible w i t h traditional B G P routers. A CBG acquires routing infor-
mation for a l l the replicated content residing on its local A S , through communication
w i t h the CIG. CBG keeps exchanging this knowledge and the corresponding updates
w i t h its peers; hence, each CBG has complete information about the routes to a l l
replicated content throughout the Internet. T o reduce the load of C I G s , the CBG
does not propagate content routing information it learns from the Internet into the
CIG under most circumstances.
Occasional communication initiated b y CBG to
CIG is detailed i n section 3.2.4.
To summarize, our proposed content internetworking architecture consists of
the following set of protocols:
1. Content Gateway Interconnection Protocol ( C G I P ) : dealing w i t h interaction
between CIGs.
2. Content Gateway and Border gateway interconnection Protocol ( C G B P ) : dealing w i t h interaction between CIG a n d C B G .
3. Content Border Gateway Protocol ( C B G P ) : dealing w i t h interaction between
CB Gs.
Subsystem Descriptions
In this section, we depict functions of each sub-system shown i n Figure 3-1.
C N Internal Content Request-Routing System
A CN's internal R R system is responsible for delivering client requests to the "nearest" surrogate located i n that CN [3]. D N S a n d U R L redirection based request
routings are the two most commonly accepted CN internal content routing techniques. I n general, content internetworking views CNs as a black boxes; allowing the
CN supplier the m a x i m u m flexibility i n choosing its own internal content routing
Content Gateway Interconnection Protocol (CGIP)
Initially, each CIG
has only the knowledge of routing status for content residing on
its own CN, learned through routing updates from the CTV's internal request routing
protocol. If there is more than one CN i n an A S , a CIG can, by establishing peering
relationships w i t h neighboring
others so that a l l
CIGs i n its local A S , exchange routing updates w i t h
CIGs i n an A S share a consistent view of content routing states for
all content residing on that A S . A CIG can make content routing decisions based
on its local policies and a set of metrics exchanged w i t h peers.
CIGs also need to communicate w i t h CBGs, and since a l l CIGs i n the same
A S can synchronize w i t h each other, and since a l l have global content information
inside the local A S , it is not necessary for a l l the
CIGs to communicate w i t h a l l
CBGs. System administrators choose only one CIG to represent a l l CIGs i n that
A S to communicate w i t h a certain CBG for content routing exchanges i n order
to minimize the traffic load. T h e representative can be manually specified i n the
configuration. A s a fault tolerance consideration, besides the p r i m a r y representative,
a secondary representative can be configured. However, this method is not
enough. A better mechanism lets
CIGs communicate w i t h each other to negotiate
how to dynamically elect a CIG that best serves as a representative i n terms of CIG
load and network proximity to the C B G .
Content Gateway and Border Gateway Interconnection Protocol (CGBP)
In order to let the external world of an A S know about the content information
inside a local A S , the CIG i n the local A S needs to communicate w i t h the CBG. We
use C G B P to handle the communications between the CIG and the C B G . T h e CIG
acts as a special peer of the CBG i n its A S , and exchanges content routing updates
w i t h the CBG i n an asymmetric way. T h e CIG propagates a l l routing updates to
the CBG, however, the CBG does not flood the CIG w i t h a l l the content routing
information it learns that may never be useful for clients inside a CN. T h e few
situations where the CBG takes the initiative to propagate routing updates to the
CIG are discussed i n the following paragraphs.
For content routing decisions i n a single A S , usually, a good surrogate which
exists i n the local A S is considered the best candidate to serve a client's requests
originating w i t h i n that A S . However, there may be exceptional situations where
a remote surrogate is better t h a n local surrogates, for example, when there is an
extremely heavy load o n local surrogates. T h e CBG has the whole picture of b o t h
global and local replicated content routing information, therefore, it is the CBG
that decides, for particular content w i t h replicas i n b o t h the local and remote A S s ,
which surrogate is best. If a routing decision is made that the best surrogate has
changed from a local surrogate to a remote surrogate, or from a remote to a local, the
CBG advertises this route to the CIG representative, which propagates this update
further to other CIGs w i t h i n that A S . If this situation happens too often, then it is
the local CN provider's responsibility to enhance its surrogates' capabilities.
Another situation that can cause reverse communication from the CBG to
the CIG representative arises when the CBG finds there is an explicit content route
withdrawal from its route table. T h i s explicit withdrawal indicates that a content
server is not available for service for some reason, and this update must be passed
to the CIG immediately.
Content Border Gateway Protocol (CBGP)
C B G P interconnects A S s for routing exchanges. C B G P is a superset of B G P . It
supports B G P for traditional I P routing, as well as content routing for content
internetworking purposes. W h i l e the address mechanism for the I P routing part of
C B G P remains the same as i n the original B G P , the content routing part adopts a
d u a l addressing scheme: Name+IP.
Initially, each CBG has knowledge of content available only i n its own A S ,
learned through C G B P routing updates from a CIG representative (see 3.2.4). B y
establishing peer relationships w i t h neighboring A S s , it then propagates this knowledge a n d updates to its C B G P peers. W h e n a CBG receives routing updates from a
C B G P peer, i f necessary, it updates its Routing Information Base ( R I B ) a n d further
propagates routing updates to other C B G P peers. In this way, a CBG receives the
knowledge of the content routing of the replicated content i n other A S s .
In this thesis, unless specified, we use C B G P to refer only to the content
routing part of the complete C B G P protocol. T h e details of C B G P are covered i n
Chapter 4.
Route Caching at C I G
In our architecture, fundamentally, a CIG has routing information only for content
available i n the local A S . However, the CIG can send a query to the CBG for routing
information of particular content residing on foreign A S s to resolve client requests.
W h e n considering a high similarity i n access patterns i n a group of users, one of the
most commonly accepted mechanisms for improving performance is caching, which
can be applied i n this content routing scenario. Unlike most known web caching
schemes that cache the content itself, our cache scheme caches routes for content.
W h e n a CIG receives a content response from a CBG w i t h an I P address for a name
request, the CIG caches the routes for future use. Thus, i f a client requests content
that has a replica only i n a remote A S , and i f the CIG finds a cached route for
that content, the name request can be resolved i n that CIG locally. Unless there
is a cache-miss i n the CIG, the CBG does not bother w i t h name resolution, which
greatly speeds up the resolution process. It reduces also the overall demand for the
involvement of C B G s , which could give rise to a potential performance bottleneck.
O n the other hand, caching can introduce the possibility of "out-of-date"
A n appropriately set T T L value may compensate for this shortage by trading off
the extra volume of traffic for cache refreshing. F r o m another point of view, the fact
that those cached routes are still available for content service (i.e. those servers are
not down) makes this problem appear less serious.
Content Request R e s o l u t i o n Process
O u r hierarchical system architecture enables a hierarchical request resolution process, which gives the system good scalability. In addition, the hierarchical name resolution approach completely eliminates the globally centralized authoritative system
from the resolution chain, which significantly improves service performance, system
availability and reliability. Resolution happens at three levels, from lowest to highest: the CN internal R R System level, the CN level, and the A S level. E a c h level
tries its best w i t h the knowledge of content status it has to resolve name requests
submitted by clients or from lower levels. O n l y when a request cannot be resolved
at a given level does the request escalate to the next level; i n a case where even the
highest level fails to resolve a request, it is redirected to the n o r m a l D N S chain for
A client's name request originating w i t h i n a CN follows the resolution path
as the steps described below:
1. CN Internal R R System Level - CN Internal Resolution
If the client's request can be resolved by a CN's internal R R system, which
means the requested content has either a replica or an origin (depending on
the CN's internal R R scheme) located w i t h i n the local CN, the resolution
process stops. Otherwise, the request is forwarded by the internal R R system
to the pre-configured CIG, and goes to 2.
2. CIG Level
2.1 CIG Resolution
If the request can be resolved by the CIG (usually
CIGs replaces whatever
D N S server originally exists i n the CN), which means the requested content
has either an origin or a replica copy located w i t h i n its local A S , or i f the
CIG finds cached routes for the requested content, which is located only i n
foreign A S s , it responds w i t h the address of the "nearest" server, using the
load balancing method described i n step 2.2.2. T h a t can be either the origin
server, or a surrogate server for the requested content; otherwise, it goes to
step 2.2.
2.2 Content Query from CIG and Content Response by CBG
2.2.1 T h e CIG sends a content query for the requested content to a connected
CBG a n d waits for a response from the C B G . T h e n it goes to step 3.
2.2.2 U p o n receiving a content response from a connected C B G , the CIG
caches the response, a n d i f the response contains one content server's address,
the CIG responds to the CN internal request routing system w i t h the address
of that server. W h e n there are multiple servers' addresses i n the response,
the CIG chooses one of the servers based o n load balancing algorithms. T h i s
algorithm randomly chooses one of the servers, but it uses a weighted random
number generator to select the servers w i t h the probability corresponding to
the metric of each server. If no route is available from the CBG, the CIG
forwards the client request further to the normal D N S request resolution chain.
3. C B G Level Resolution
U p o n receiving a content query from a CIG, the CBG checks its Routing
Information Base ( R I B ) for the requested content. If routes are found i n R I B ,
meaning the requested content has replicas i n foreign A S s , the CBG responds
to the CIG w i t h a l l the "best" routes stored. Otherwise, the CBG responds
indicating that the requested content is not replicated i n any foreign A S , a n d
goes to 2.2.2
T h e discrimination i n the handling of replicated content and non-replicated
content is a n important feature of our approach a n d a n important strategy for
achieving good scalability. T h e CIG has knowledge of non-replicated content but
only propagates replicated content to the C B G . W h e n a client requests content that
is not replicated, i f the content's origin server resides on the local A S , the request is
resolved by the CIG directly; i f the content's origin server resides on a remote A S ,
and there is no cached route for that content, the request is finally redirected to the
normal D N S chain for resolution.
Compared to traditional D N S resolution approaches, ours significantly reduces the response time for replicated content, which is the most widely accessed
content on the Internet. However, there is a small penalty imposed on requests for
content without replica and w i t h the origin server located i n remote A S s for which
there are no cached routes. T h i s extra overhead, however, is only one round trip
from the CIG to the CBG located i n the same A S , which is acceptable. Additionally,
the reason a particular piece of content is not replicated across the Internet is very
likely to be that it is unpopular, and therefore, seldom requested or accessed. Thus,
the small additional overhead does not noticeably degrade overall service quality.
Further, by populating the CBG w i t h the routing states for only replicated content,
we significantly reduce the size of its routing table, compared w i t h the pure name
based routing approach [11].
W h i l e Content RR w i t h i n a CN is handled by the proprietary schemes of
different vendors, routing processes for C I is our research focus i n this thesis. In
our hierarchical architecture, the C B G P deals w i t h the interconnection of content
i n different A S s , while the C G I P deals w i t h the interconnection of CIGs i n the same
A S . A s the C G I P is described i n other works [35], our discussion focuses only on
C B G P i n this thesis. We give its design details and discuss design issues i n Chapter
Chapter 4
Design of C B G P
T h e goal of content internetworking can be briefly described thus: by interconnecting
the CNs around the world, we can find a good surrogate server globally that can
serve client requests anywhere on the Internet w i t h good quality, "good" can be
measured i n different metrics, such as server response time, server-client latency and
throughput, server health and load, and so forth. Apparently, network proximity
between clients and servers is a n important factor, even the only factor considered
i n some solutions for the content route selection process. T h i s factor is naturally
a built-in feature of the physical path from server to client.
T h e physical path
information is well maintained by routers along the path, therefore, it is natural to
have routers involved i n the content routing process to provide real time network
accessibility information. T h i s network-integrated content internetworking approach
saves the application layers from doing the proximity measurements, for example,
probing w i t h "ping", thereby significantly reducing network traffic. It can also solve
the problem that arises when servers are behind a firewall or other Network Address
Translation (NAT) device, where probing is prohibited. The motivation behind the
decision to delegate content internetworkingfromthe pure application layer to the
network layer is the desire to fully exploit the existing network layer's functions and
resources. From the scalability point of view, the interdomain routing system using
BGP is regarded as one of the most successful, and the largest, distributed systems,
since it enables the working of the whole Internet. Our content internetworking
shares similar feature with this.
Depending on the granularity of the replicated content, we may have first
level content names such as "," or second level content names, such as
"" While disk storage is not a scarce resource nowadays, most
content names should only be located at the first level.
Content routing, which attempts tofinda short route to a replicated piece
of content, is very similar in nature to a normal IP anycast that tries to find the
"shortest" route based on measurements such as network hops. In content routing,
all replicas of a piece of content, for example, share the same Anycast
Content Name
(A CN) Further, this A CN appears in the RIB repre-
senting a virtual content destination node on the network which is shown in Figure
We route content in a way similar to IP routing, except that, instead of
trying tofindthe best route to a destination represented by an IP prefix, the A CN,
representing several content replicas, indicates the content destination address. This
is discussed in detail later.
As shown in the general system architecture, we interconnect the content of
CNs in different ASs through interdomain routing, and in particular, we modify
the Border Gateway Protocol (BGP), which forms the CBGP, to enable the content
internetworking function. In implementation, the newest version of the B G P is used:
Features of C B G P
We w i l l now discuss the features of the C B G P design.
1. N a m e + I P addressing mechanism
A s content destination is represented by content name, theoretically, we can
route content using purely name-based routing globally, just as suggested i n
T h a t means the I P address i n traditional I P routing is completely re-
placed by the A C N . However, this requires a change i n the transport layer
protocol, and correspondingly, a l l the client side networking software, since
all network connection establishment and data transmission still uses the I P
addresses instead of names currently on the Internet. Practically, we can not
use pure name-based routing, but we rely on the routing process to find a
good route to content by directly pointing at a surrogate address for access;
hence we do not have a N e x t - H o p attribute for the content routing table, as
the normal network routing table does. Therefore, we call our content R I B
"content route table" instead of a "routing table".
We adopt a combination of the Name and the I P addressing mechanisms to
represent replicated content, which means a piece of content is represented by
its name, plus the I P address of the server o n which the content resides. For
example, if content "" has a replica on server, that specific piece
of replicated content is identified as a tuple
(, W h e n content
represented as a tuple (ACN, IP) has changes i n its metric, the updates are
sent i n content routing update packets, a n d the CBG makes a decision by
comparing the content routes.
Figure 4-1 illustrates a n example for the C B G P routing process. metric 28
AS_path(...)NH ...
I I metric 162
AS_path(...)NH ...
: virtual content destination node
: pointing to a virtual content destination node from a content server
: indirect connection passing through networks
: direct connection
Figure 4.1: C B G P routing process
2. L o a d Balance Considerations
T h e content route table of a CBG maintains the "best" route selected by
the routing decision process to a l l replicated content by exchanging content
routing information w i t h C B G P peers. For load balancing and fault tolerance
purposes, multiple routes to each piece of content may be stored i n order of
goodness from the "best" route, to the secondary route, to the t h i r d route,
and so on. T h e number of spare routes the content routing system actually
keeps for a particular piece of content is an implementation trade-off between
system flexibility and space overhead.
3. Server Side Metric Participating i n the R o u t i n g
A s the routing process aims at locating a server that can serve content requests
well, we need to accommodate server side metrics i n overall routing decisions
i n addition to network conditions. T h u s i n our approach, the server side metric
directly participates i n content routing.
4. Fully D i s t r i b u t e d
Content internetworking through C B G P has the same nature as I P routing on
the Internet: it is fully distributed without any centralized point.
5. Content Query Resolution Function
A s well as performing normal content updating processing, the CBG also
functions as a content query resolution server by answering content queries
from CIG {a).
Overview of Content Announcement and W i t h d r a w a l
We introduce the basic idea about how to modify B G P to accommodate content
routing i n this section.
Content Update Message Format
T h e C B G P shares many protocol features w i t h the B G P , such as an open connection
and periodic probing. Thus we integrate the C B G P w i t h B G P for efficiency and
easy deployment.
T h e C B G P adopts similar message types as B G P 4 -
K E E P A L I V E , U P D A T E , a n d N O T I F I C A T I O N . C B G P ' s O P E N , K E E P A L I V E and
N O T I F I C A T I O N message formats are same to those of BGP4's. A s the U P D A T E
message intrinsically carries content routing information, it embeds mechanisms
specifically for content routing; this is done by extending the p a t h attributes of
In B G P , an U P D A T E message is used to advertise a single feasible route to a
peer, or to withdraw multiple unfeasible routes from service. A s name based routing
differs from I P based routing i n route attributes, we a d d a n extra p a t h attribute
called "CONTENT"
to represent content routing information.
E a c h path attribute i n the U P D A T E message is a triple [attribute type,
attribute length, attribute valuer of variable length. There is an unique attribute
type code i n attribute type to identify individual attribute.
attribute has the following value:
- Attribute Length: specify the total length of this type of attribute
Attribute type: Type Code: 20; the attribute flag is set to define this attribute
as optional transitive
- Attribute value:
Content Update Type: 1 bit ("0" - content withdraw, " 1 " - content announcement)
• Host Sub-addr: w i t h variable length, specify host part of the I P address of
the content server.
T h e actual length of this field can be calculated from
the length of Network Prefix field i n Network Layer Reachibility Information
( N L R I ) which is a key field i n each U P D A T E message.
Content Identifier Length: specifies the length of the Content Identifier field
Content Identifier: variable length, specifies the name of the content
• Server-Side Metric: specifies processing capability of the associated content
Valid Time Period: how long this route can be thought of as valid
• Optional Field: this is a field for future function extension
Content Update Type (1 bit)
Host Sub-addr (variable)
Content Identifier Length (8 bits)
Content Identifier (variable)
Server-side Metric (8 bits)
Valid Time Period (16 bits)
Optional Field (8bits)
Figure 4.2: C O N T E N T attribute format
We use the field " V a l i d T i m e Period" mainly for considering the "health" of
the server. T h e CBG has no way of knowing i f a specific server is frequently down
when the server itself keeps the relevant statistics. T h i s information is propagated
to CIG which i n t u r n sends it to CBG. Therefore we keep a time period to represent
the effective time of metrics and routes. Using ' V a l i d T i m e P e r i o d ' instead of the
absolute time of expiration accommodates the time difference of each system i f they
are not synchronized.
B G P is a p a t h vector protocol as such i f there exists a loop i n the path, the
update packet is discarded. Therefore, we do not need to make an extra effort to
prevent the loop of routing advertisement packets since we inherit the B G P routing
In C B G P , one single U P D A T E message can announce multiple content routes
availability, a n d withdraw multiple previous content routes availability information,
as long as the servers for the content share the same Network Layer Reachability
Information. T h e announcement and withdrawal updates are not necessarily put i n
order i n the U P D A T E message. For content withdrawal, no Server Side Metric a n d
Valid Time Period field are presented. If the Content Identifier Length is set to 0,
all content o n that content server are to be w i t h d r a w n , a n d no Content Identifier
field is followed.
Here is a sample scenario where a CBG sends a n update: the CBG propagates content announcements for (with, server side metric 36), (with, server side metric 129), (with, server side metric 825), a n d makes content withdrawal for
(, a n d a l l contents o n T h e three content advertisements
and two withdrawals can be aggregated into one U P D A T E message as follows:
Network Layer Reachability Information:
Attribute Type Code: 20
Attribute Length: 58 bytes
Attribute Value:
* Content Update Type: 1
IP Subnet: 26/8
Content Identifier Length: 9
Content Identifier:
Server Metric: 36
* Content Update Type: 1
IP Subnet: 5/8
Content Identifier Length: 10
Content Identifier:
Server Load: 129
* Content Update Type: 1
IP Subnet: 108/8
Content Identifier Length: 7
Content Identifier:
Server Load: 825
* Content Update Type: 0
IP Subnet: 16/8
Content Identifier Length: 10
Content Identifier:
* Content Update Type: 0
IP Subnet: 31/8
Content Identifier Length: 0
Processing of the Content Update
W h e n a CBG receives a content update message, its actions are as outlined below:
• U p o n receiving a content announcement:
if there is no existing content entry for the announced content i n its R I B , it
adds that content i n the R I B w i t h the associated advertised metric; otherwise,
if there already exists a route for the announced content, the CBG compares
(in a routing decision process covered i n later sections) the newly updated
route w i t h existing routes. I f the new route is better than any of the existing routes calculated by the routing decision process, the R I B is updated by
replacing the most undesirable route w i t h the announced route; otherwise, it
ignores the content advertisement.
• U p o n receiving a content withdrawal:
it deletes the related content entries associated w i t h that specific server address
from its R I B , and then recalculates the new routes to the content as well as
sending the withdrawal to the peering CIG if the withdrawal affects the content
route table.
• U p o n receiving a n I P route withdrawal that withdraws an I P prefix's availability:
it deletes a l l related content entries associated w i t h that I P prefix from its
R I B , and then recalculates the new routes to the content, as well as sending
the withdrawal to the CIG, i f the withdrawal affects the content route table.
T h e entries i n the content route table take the format (Content-Name,
M e t r i c i , P e r i o d i ), {IP2, Metric2, Period2 ) • . . (IP ,
M e t r i c „ , P e r i o d ^ ) ), where n
is the number of best routes the CBG maintains and P e r i o d represents the valid
time period the route has. W h e n a change occurs i n its R I B , the CBG propagates
this update to its C B G P peers.
T o achieve scalability and incorporate the diversity of different autonomous
systems, B G P can not rely on accurate network proximity information when making
routing decisions; instead, it uses predefined policies to calculate route preference.
Generally, these policies reflect the wisdom of choosing the most suitable (though
not necessarily always "best") route for the current A S . I n addition, some recent
studies [27,31] show the A S hop count of a path is a decent indicator of the path's
proximity, reliability, and stability. To simplify things and avoid heavy modification
of B G P , we decided to adopt the path preference calculation strategy of B G P and
adopt a reasonable way to make comprehensive content routing decisions.
preliminary solution is described i n section 4.4. For a more accurate solution, we
discuss another design using refined metrics i n section 4.5.
A P r e l i m i n a r y Solution for Content R o u t i n g D e c i sion
We give a preliminary method to calculate content route preference, and describe
the basic routing decision process i n this section.
Content Route Preference Calculation
T h e CBG may receive multiple routes for particular content v i a the C B G P peers'
advertisements that originate from the same or from different content servers. These
routes contain b o t h network accessibility information and a server-side metric. W h e n
considering the network level measurement factor, we adopt a similar method as used
i n B G P 4 for calculating the degree of route preference, i n order to achieve compatibility and share computation processes.
T h e difference is that instead of solely
comparing routes to the destination of the same I P prefix, multiple I P addresses
(to be accurate, multiple N L P J s ) of those content servers are used to compare the
routes, since the concept of "destination" now refers to a piece of content w i t h
many replicas. T h e B G P local policies also apply to C B G P , for example, whether
the current A S is willing to be the transit for the content i n certain other A S s .
For content routes originating from different content servers, i f a specific
route is superior to the others i n terms of b o t h network level p a t h preference and
the server side metric, routing decision can easily be made. B u t i f there is such
apparent advantage to b o t h factors, comprehensive consideration of network level
factors a n d the server side metric is needed.
If two routes have unbalanced advantages on either the network level or the
server side, we define a function that takes the attributes of a given route as the
arguments and return a value denoting the degree of preference for the content route.
For the time being, we consider all the main attributes of a route which include the
LOCAL J^REF, the AS_PATH and the CONTENT, and define the overall preference
of a route as the following:
Pref = scale (j
• Server-Metric) + scale( (i
• AS-path) + a
• Loc-Pref
The value of the three route attributes LOCAL.PREF, AS-PATH length and
Server_Metric could be measured on different scales, which gives them incomparable
values. Therefore, we need to re-scale these values to the same magnitude to keep
consistency in the comparison. We scale the AS path length and server side metric
to the same order of the largest value of local preferences, since the router knows
the value of local preferences for each of its interfaces.
The coefficients for each metric are currently defined as 7=0.5, as we consider
the server side metric has almost the same weight as the network level, according
to the measurement of [25], which states that the correlation between the ping
round trip time and the HTTP request response time is 0.51. /3—0A, as the AS
path length is an indicator of network level proximity in the BGP level. Finally
OJ=0.1, as local preference is also a non-negligible factor in making routing decision
in BGP. These values are only our initial settings with simple heuristics; a system
administrator can adjust these parameters in response to real situations, including
considering different set of coefficients for different categories of routes. We also
need to perform preprocessing of the value of local preference, as the router usually
prefers the one w i t h the largest value, which is opposite to A S path length and
server side metric (the lower the value is, the more preferable the route is).
There are two parts of the A S - P A T H attribute, AS_set (unordered set of
A S s a route i n the U P D A T E message has traversed) a n d AS_sequence (ordered set
of A S s a route i n the U P D A T E message has traversed considering the aggregation
functions). W e calculate the actual A S p a t h length using the following formula:
AS-Path-Len = length(ASsequence) + log2(length(ASset) - length(ASsequence))
Our rational behind this is that each time several routes originating from
different A S s are aggregated into one route, the A S numbers for those independent
routes are eliminated from the A S - S E Q U E N C E a n d instead, the local system which
does the aggregation prepends its own A S number i n the A S _ S E Q U E N C E . A l l the
A S numbers still appear i n the A S _ S E T . T h e aggregation process is quite similar
to (not exactly like, i n the real Internet) a tree structure, so the logarithm approximates the height of the tree that the route traverses i n addition to the current
A S _ S E Q U E N C E length.
We can also define a local policy w i t h a re-configurable threshold T o n the
server side metric for filtering out those routes that have unacceptable server response times.
Content Routing Decision Process
The decision process followed b y the C B G P to select a preferred route to a specific
content from multiple ones is described below:
1. Filter out the unacceptable routes using threshold T
2. If a route update specifies a next hop that is inaccessible, the route is dropped
3. Calculate the preference of each content route using the above formula, the
one w i t h the lowest value wins
4. If a tie results from step 3, the one w i t h the better server-side metric is selected
(since the server-side metric is a more accurate measurement than network
preference as calculated using the routing strategy of B G P )
In C B G P , route aggregation i n the process of route updating is possible
through integration w i t h B G P 4 and by sharing the I P prefix of content destination.
We have already shown an example i n the scenario depicted i n section 4.3.1.
Design w i t h a Refined M e t r i c for N e t w o r k C o n d i tions
A s believed by most of the people, using A S p a t h length is a too rough measurement
to calculate network proximity, we give another design i n this section to make more
accurate measurement about network conditions.
Path Latency Measurement
A s shown i n B G P , peers constantly exchange K E E P A L I V E packets to probe each
other to maintain knowledge of currently available neighbors.
T h i s is a waste of
b a n d w i d t h and computation resource i n most cases when peers are alive. We can
utilize these packets for further measurements of the packet latency i n the routing
path. T h i s is facilitated by the router's I / O architecture.
A typical router's architecture is shown i n F i g u r e 4-3.
o utpu t interface
ir iput interfac;e
i i i
i i i
i i i
— i i— i i— ii
- •
Figure 4.3: A typical router architecture
T h e K E E P A L I V E packet is sent out through one of the output queues, just
as any I P data packet being forwarded, it is not put i n a queue w i t h a high priority
by most "vendors (sometimes the update packet caused by routing table changes is
sent through a high priority queue [33]). A l s o from [34], we find that when there
is an overwhelming quantity of data packets to process, the K E E P A L I V E message
transfer can be greatly affected or even lost. So, by measuring this delay, we gain
relatively accurate data regarding latency between B G P routers, which includes
propagation, queuing, and transmission delay (the latter two also accommodate
the factor of bandwidth).
T h i s is a fair measurement to a l l A S s w i t h regard to
inter-domain routing, as opposed to different intra-domain routing metrics.
T h e way to measure this latency is to use a triggered K E E P A L I V E sending
scheme, i n which when one peer receives a K E E P A L I V E packet w i t h a sequence
number i n it, it acknowledges it immediately w i t h a K E E P A L I V E message; thus
the peer initializing the sending process attains the round t r i p time for the packet.
Routing Decision Improvement without Changing IP Routing
W h e n the latency between the CBGs is known, we can now include another route
attribute, which we call " L A T E N C Y " , i n the U P D A T E packet. T h i s attribute is
also an optional transitive attribute and takes the attribute type code 21.
attribute value is the accumulated network latency along the way, and it is similar
to the sum of the results measured by using the utility "traceroute".
In this approach, the I P routing still follows the original method by which
B G P works. For content routing, however, we now have more accurate measurement
for network proximity. W h e n a CBG has multiple paths to the content, it compares
the sum of the accumulated network latency along the way, and the server side
metric of each path, then selects the best one. W e prefer to use server response
time as the server side metric here; i f not, we need some way to do the corresponding conversion from other metrics. T h i s is discussed i n Chapter 6. If the update
causes content route table changes, the new updates caused are sent to its peers and
the accumulated network latency i n the U P D A T E packet increases by the network
latency between those two peers.
In Figure 4-4, there are five A S s and we assume there is no Internal B G P
( I B G P ) communication inside each as we focus on interdomain communication here.
T h e number on the link indicates the network latency between the peers through
that link, measured using the method described i n the previous section. C S 2 and
CS1 are two replicas of content A w i t h A C N url-1 residing on A S I and II respectively.
In A S V , the local preferences for the routes coming from A S III, I V and V I are 50,
100, 100 respectively. We also assume that the server side metric for C S 2 and C S 1
are 10 and 6 respectively; C S 2 resides on a server w i t h the I P address A d d r 2 , which
is attached to subnet X ; C S 1 resides o n a server w i t h the I P address A d d r l , which
is attached to subnet Y . Under the conditions stated above, i n A S V , we have the
following I P routing table entries ( not a l l the attributes of the route are listed ) and
content route table entry (we omit the V a l i d T i m e Period for the route a n d assume
at most two routes are kept for each content):
A S - p a t h (in A S - S E Q U E N C E )
I V , I ( L O C . P R E F E R E N C E & A S _ P A T H dominate
the decision)
V I , II ( L O C _ P R E F E R E N C E & A S _ P A T H dominate
the decision)
Table 4.1: I P routing table for the sample topology
( A d d r 2 , 17), ( A d d r l , 19)
Table 4.2: Content route table for the sample topology
Figure 4.4: A sample network topology
Clearly, the accuracy of this approach matches the application layer ap-
proach. T h e application layer measurements follow the physical path to those content servers calculated by B G P , and i n our approach, the network latency measurement of the paths is used to make close-to-best route selection. T h i s method
ameliorates the disadvantage of using hops as a measurement for network proximity.
More importantly, it greatly reduces the application layer probing traffic to
servers or networks, which could be repeatedly and constantly created by many
Currently, we use the content server side metric as the trigger for content
It is possible that during the interval between content updates, some
changes occur to the accumulated network latency along the path to the content
server, but at present we put this scenario aside because we believe the change
of network latency is largely caused by much heavier or lighter access to content
and this should be reflected on the server side metric. We consider including b o t h
triggers i n future work.
E v e n though B G P routers generally exchange only connectivity information,
not performance information, and i n the absence of explicit policy, the routers make
decisions by minimizing the number of independent autonomous systems traversed
along the way to the destination; this metric doesn't correlate w i t h performance
characteristics very well, but it doesn't affect our approach above. T h e rationale
behind this is that the decision of the B G P router is fair to a l l users: as soon as the
decision is made, the path is defined and every packet from any application program
has to follow it. However, we can make further improvement for content routing
by influencing the I P routing process, since we can attain measurements of network
latency along the way. We discuss this i n the next section.
Routing Decision Improvement by Influencing IP Routing
In each U P D A T E packet, there is the " L A T E N C Y " attribute which records the
accumulated network latency of the p a t h traversed, and we use this metric to find a
better path (from the network routing point of view) to reach that server. For the I P
routing, when a CBG receives an update, firstly, it still applies the policy routing,
such as comparing the weight of routes specifically defined by some vendors, local
preference of the p a t h and so on, since we need to respect the subjective choice of
the network operator.
If all the previous comparisons result i n a tie, it compares
the length of the A S path.
A t this stage, instead of using the A S p a t h length,
it compares the accumulated network delay of multiple paths and selects the best
one. For content routing, the sum of the network delay and server response time is
used as the comprehensive metric to compare the routes to a content destination.
If a better route results from either the I P or content routing, updates are sent to
A g a i n , referring to Figure 4-4, i n A S V , we have the following I P routing
table entries and content route table entry:
A S - p a t h (in A S . S E Q U E N C E )
I V , V I , II
Table 4.3: I P routing table w i t h influence of L A T E N C Y
( A d d r l , 16), (Addr2, 17)
Table 4.4: Content route table w i t h influence of L A T E N C Y
T h e reason that the A S p a t h to the network Y is (IV, V I , II) instead of (VI, II)
is because the accumulated network latency dominates, since the local preference to
A S number I V and VT have the same value. For the content route table, considering
b o t h the network proximity and server side metrics, we get the result shown above.
A l t h o u g h the above method can produce a better route for accessing the
content, we don't advocate using it, since the frequent changing of the I P routing
behavior could cause route instability on the Internet.
M o r e Design Issues
We discuss more design issues i n this section including: how to control content
update frequency i n the Internet, how to make use of the periodic network latency
measurements, how to deal w i t h network p r o x i m i t y measurement inside an A S , and
what is the behaviour of the convergence process for content routing.
Controlling the Content Update Frequency
T h e Internet is a huge distributed system. Considering its scale, m a k i n g frequent
route changes is not accepted for stability and scalability. O u r intention is to interconnect content networks around the world and find a good replica to serve client
requests, so it is not necessary to respond to a l l metric changes i n real time, as many
changes are transient. Thus we need to control the content update frequency.
A s we know from the description i n the previous section, i n addition to
selecting content routes based on a n o r m a l network level accessibility factor, extra
effort must be made to take the server-side quality of service into consideration.
T h i s can be measured i n terms of server response time to client requests, server C P U
load, server health, m a x i m u m connections to the server, and so forth. T o measure a
server's processing capability at a certain time, we prefer to use server response time,
as this is the metric that reflects the service quality for a request, since our purpose
for content networking is to improve content response time. A s we can imagine,
the server side metric is usually quite a transient parameter, and sometimes can
change dramatically i n a very short time. To smooth out the oscillation effects, this
metric is usually recorded periodically, and reflects the average value i n a moving
window of time. (We refer to the size of this moving window as the measurement
interval). Compared w i t h the value i n the previous measurement interval, when the
value of the next interval has a significant change, then an update should be sent.
T h e Measurement interval can be adjusted by each server.
Since the server-side metric is learned v i a the C G B P through the CIG, which
i n t u r n acquires this metric from the CN internal R R system, we have an alternative
way to control the content update frequency. To keep the high accuracy and real time
response inside a CN, the server response time can be updated at a relatively high
frequency, but the CIG can do further processing to suppress the frequent updates
to CBG, using monitoring and weighted calculations of historical changes.
In our implementation, we use the technique described i n the previous paragraph)
Usage of Network Latency Measurement
We calculate the latency between CBGs periodically, but we do not use a l l latency
values at each moment, because the updates may not be sent as frequently as the
latency measurement. Therefore, we need to discover a way to calculate the actual
latency to use i n the update packet. To calculate this value, we borrow the idea
from T C P about how to compute the retransmission timer, since our purpose here
has similarities w i t h it. We have two variables, L T and M , representing the latency
we want to calculate and the latest measurement of the latency, respectively. T o
accommodate the effect of the variance between the new measurement and the
historical value, we use a variable, D , to calculate the deviation by the following
D = SD + (1-5)
| LT - M |
Where 5 has the value of 7/8. T h e calculation of L T is:
LT = M + 4*D
E a c h time after an U P D A T E packet is sent, L T and D are reset to start
Network Proximity Inside an A S
There is one important factor that affects content routing decisions: the delay i n the
original A S where the content update initiates. Due to the different size of the A S s ,
this factor can have a different weight i n the influence to the whole route decision
To account for this factor, we take a rough measurement of latency inside
an A S . A s described i n the previous sections, by measuring the latency between
border routers through K E E P A L I V E packets, each border router of the Internet
Service Provider (ISP) stores the latency measured from this router to the other
border routers i n its network (the I B G P routers). B y averaging the latencies stored,
we arrive at the approximate latency i n this A S . T h e content update packet then
includes this number as the initial value for attribute L A T E N C Y . T h i s value can also
be used as a reference to compare whether a content source inside an A S is better or
worse than one outside an A S . T h i s approach is scalable because the measurements
are performed locally (to an I S P ' s network) and the information stored at the border
routers is only on the order of the number of border routers i n the I S P ' s network.
T h i s idea is used i n [21] as well.
Convergence Process Discussion
Since our name request resolution hinges on the Name+ I P addressing mechanism
that hardwires the I P address to the specified content i n the routing process, concerns arise regarding the route convergence process. We claim that content routing
i n our approach does not have more serious problems i n routing convergence and
oscillation t h a n that i n the original B G P 4 [36].
If a content server is down, content requests directed to that server before the
B G P 4 route converges drop, but this is no more serious than the convergence behavior of the original B G P 4 . T h e content withdrawal is quickly propagated through
the content routing process, and CBGs remove that route from the content route
table, i f it is there, and the peering CIGs are informed. In addition, multiple paths
kept by each CBG ameliorate this problem.
If the route to certain content is i n the convergence process because of the
existence of a better route when the CBG is serving a name resolution query for
that content, the performance w i l l not be affected much. T h e old destination is still
reachable and can serve the content request, but w i t h only a slight performance loss,
as compared to the new one.
Deployment Considerations
Our content internetworking system allows an all-at-once deployment or incremental
evolution, based on services needs.
To incorporate the individual CN into the content internetworking picture,
the only requirement is to place CIGs at the edge of CN, which is a basic requirement
to enable interconnection w i t h other CNs. R u n n i n g C G I P on CIG w i l l enable one
CN to locate other CIG peers i n the same A S , hence interconnecting w i t h them.
E n a b l i n g the CN interconnection on a global scale requires only the upgrade
of B G P 4 to support C B G P on existing Border Gateways. B y using an U P D A T E
packet compatible w i t h the original B G P protocol and defining the new attributes
as transitive optional, we can make a step-by-step deployment for the new routing
architecture. T h e justification is that paths w i t h unrecognized transitive optional
attributes accepted as defined i n the B G P standard. If a p a t h w i t h a n
unrecognized transitive optional attribute is accepted and passed along to other
B G P peers, then the unrecognized transitive optional attribute of that p a t h must
be passed along w i t h the path to other B G P peers. T h e n , we can guarantee that attributes such as C O N T E N T can be transmitted across n o n - C B G P - e n a b l e d regions.
A n original Border Gateway that only has I P routing function ignores content attributes i n routing advertisements and passes them along to B G P peers, so it can
still work compatibly w i t h the upgraded Content Border Gateway. A n incremental
deployment p a t h greatly facilitates the extension of global content internetworking.
T h e deployment process can be based on user needs and an I S P ' s motivation
to provide a better web experience to customers and superior service to co-located
content providers. For replicated content, name resolution can be quickly solved
w i t h the result of returning a good server, eliminating the need for name requests
to leave their network. For content without replicas, the name resolution fails over
to normal D N S behavior. T h i s i n i t i a l deployment requires no change to end hosts.
W i t h more and more content being replicated, the content route table may
become bigger i n CBGs, which may r u n out of resources to store route table and
answer queries. I n that case, we can deploy an active server co-located w i t h the
CBG, to store content routes and processes content requests.
I S P s that already
peer at the I P routing level are motivated to peer at the content routing level to
provide their customers faster access to nearby content servers and increase the
benefit of placing content servers i n their networks. A s demand grows, the routers
w i l l be upgraded gradually to adapt to content internetworking requirements.
Chapter 5
Implementation and
Implementation Based on M R T
We introduce our prototype implementation i n this section. T h e focus is on the
implementation about C B G P . Those of other protocols can be found i n [35].
Implementation Overview
To test the complete system framework, we implement a l l the components i n Figure
3-1, including simulating the CN itself, as we don't have an existing CN system to
experiment with. Thus, the implementation includes a server side metric updating
protocol between content servers and CIG, C G I P , C G B P , and C B G P .
We have a monitoring agent installed on each server that records the server
performance status each minute, and sends updates to the CIG through port 10789
whenever a significant change happens.
CIGs inside an A S communicate w i t h each other, exchanging the content
status inside each CN. E a c h CIG has a name resolution thread running on port
53, a n d the C I G replaces the normal D N S server r u n n i n g on the machine. One of
the CIGs is selected as the representative to communicate w i t h the CBG through
C G B P . T h a t CIG is configured as a n internal peer of the C B G . T h e focus i n this
chapter is on C B G P .
C B G P is implemented by extending the Multi-threaded Routing Toolkit ( M R T )
[14, 15, 16, 17] under L i n u x to support content internetworking.
Introduction of M R T
M R T is a routing toolkit developed by the University of M i c h i g a n / M e r i t Network.
Besides tools such as the traffic generator a n d message format converter, the key
part of this toolkit is a routing daemon called M R T d . It supports R I P n g , B G P 4 + ,
multiple R I B s (route server), and R I P 1 / 2 . M R T d reads Cisco Systems-like router
configuration files and supports a Cisco Systems router-like telnet interface.
T h e routing architecture design of M R T incorporates features such as parallel
lightweight processes, multiple processor support, a n d shared memory. A l t h o u g h
M R T has been designed w i t h multi-threaded, multi-processor architectures i n m i n d ,
the software w i l l r u n i n emulation mode on non-thread capable operating systems.
T h e modular design of the software encourages the r a p i d addition a n d prototyping
of experimental routing protocols and inter-domain policy algorithms.
M R T is w r i t t e n i n C programming language, a n d contains about 120,000 lines
of code, amongst which there are more than 30,000 lines specifically dealing w i t h
B G P (this excludes the code for data structures a n d operations shared by the whole
M R T ) . Currently, we have only completed the implementation of the preliminary
solution described i n the previous chapter based o n M R T due to limited time, a n d
we are still working on the refined solution.
Our focus is the extension of the routing daemon MRTd. In particular,
implementation of B G P 4 in the MRTd is modified and extended to support content internetworking. Content query/response handling functions on the CBG is
implemented in a separate MRT thread.
Content Internetworking Implementation
• Data structures
In addition to the regular IP routing table which is represented by a radix tree
in MRT, we need another data structure for the RIB of content routes. Currently, we
use a hash table for experimention. A patricia tree is also a good data structure for
this purpose since we assume that only a low percentage of websites on the Internet
are replicated, but we have not implemented it. The hash key for the content route
table is the name of the content, and each entry in the content route table is a data
structure defined as the following:
typedef struct
char *cname; //content name
one_entry *entry; //the l i s t of routes f o r t h i s content
} croute_item_t;
typedef struct _one_entry{
char * p r e f i x ; //the server which holds the content
*bhead; //the pointer t o the bgp RIB node
unsigned i n t metricN; // metric of the network l e v e l
unsigned i n t metricS; // server side metric of the content
long expire_time; // the expire time f o r t h i s entry
struct _one_entry *next; //next pointer i n t h i s l i n k
} one_entry;
E a c h route to the content is inserted into a linked list i n order of the calculated preference value.
The CBG maintains one data structure for each of its peers, which contains
a l l the information and data storing substructure needed when communicating w i t h
that peer. T h e m a i n fields of the data structure are listed below:
typedef struct _cbgp_peer_t {
char *name; /* peer name */
prefix_t *peer_addr; /* peer address */
i n t peer_as; /* peer's AS number */
u_long peer_id; /* peer's router i d */
i n t peer_port; /* port number i n case i t i s not 179 */
nexthop_t *nexthop; /* immediate next hop */
i n t sockfd; /* socket connected t o peer */
LINKED-LIST *ll_announce; /* store announcement from the peer */
LINKED_LIST *ll_withdraw; /* store withdrawal from the peer */
cbgp_attr_t * a t t r ; /* a t t r i b u t e s of the path */
LINKED_LIST *ll_update_out; /* updates to be sent t o peer */
radix_tree_t *routes_in[AFI_MAXj [SAFI_MAX] ; /*incoming routes*/
HASHJTABLE *content; /* content routes r i b _ i n * /
} cbgp_peer_t;
• Content route updating process
Since the packet format is changed with the addition of new attributes, the
first thing we need to do is to modify the encoding and decoding process for the
UPDATE packet. When a new peer relationship is established, all the entries in the
content route table are sent to the peer.
By processing all the route updates from individual peers, the CBG keeps
on updating its view to the outside world. Each time a CBG finds a change in its
view of the IP or content routes, the changes are sent to all the other peers, except
the one that announces the new route.
The processing of the updates can be divided into the following three phases:
(only the processing of a route announcement is described below, since content
updates are not involved with the.processing of the normal IP route withdrawal.
We assume there is only one prefix in the NLPJfield,since processing is the same
for each).
Phase 1: In this phase, the announced route gets checked and put into the
RIB-in for that peer. PJB-in stores route information that has been learned from
inbound UPDATE messages. The entries in it represent routes that are available as
an input to the decision process. A processing flow diagram is shown in Figure 5-1.
The Check Attributes step in the diagram aims at verifying the path attributes to guarantee it has the all the mandatory attributes, such as NEXT_HOP,
Receiving an
Figure 5.1: Phase 1 of content route updating
Phase 2: T h e view of the overall content routing map i n the CBG is updated
according to the route change resulting from phase 1. T h e best routes are chosen
out of a l l those available for the content destination, and installed into the local
R I B . T h e changes to the R I B are stored i n the change list. T h e m a i n processing
flow is shown i n Figure 5-2.
Policy Check
"""Only CONTENT~-~~
—>_attr changed?
normal bgp
Calc route preference,
select best content
add content
updates to change
list, set flag
Combine updates
in change list
-•//^ Go to phase
Figure 5.2: Phase 2 of content route u p d a t i n g
The Policy Check step filters the routes and p a t h attributes manipulating to
influence its own decision process as the local policy defines. It could filter certain
networks coming from a peer while accepting others. I n case of aggregation, the
content server address only matches a less specific route (with a shorter prefix), it is
necessary to recalculate the sub-host address i n the C O N T E N T attribute to match
that prefix.
In this phase, if a new route is added to the content route table, the expiration
time is calculated by adding the V a l i d T i m e Period to the current system time.
Phase 3: For the changes i n the list resulting from phase 2, each peer,
except the one announcing the new route, gets scanned to check whether it has a
contradiction w i t h the predefined policy. T h e changes are disseminated to those
peers i f the policy allows it. Meanwhile, the route withdrawal and the route changes
to the content existing i n the local A S are also propagated to the CIG representative.
• Content resolution thread
CIGs send content requests to CBG whenever they can not find an answer
i n their own content route table, or the entry i n their cache expires, either because
the caching T T L expires, or because the V a l i d T i m e for that entry expires. There
is an i n d i v i d u a l thread i n CBG w h i c h deals w i t h such requests.
The content request/response processing between CIG and CBG is internal
i n the whole system architecture, a n d does not relate to the end user's request
directly, so the communication port does not have to be 53. We chose 553 as the
listening port.
E a c h time a content request is received, the content route table is searched,
and i f the item is found, the routes stored i n the linked list are retrieved one by
one and then encapsulated into one singe data structure to pass back to the CIG.
In the process of route retrieval, the expiry time for each route is checked and i f it
expires, the route is discarded. Otherwise, the new V a l i d T i m e Period is calculated
by computing the difference of the current system time and the expected expire
• T i p s i n system implementation
M R T d is a routing daemon running on L i n u x . W h e n it runs, no terminal
is associated w i t h the threads, so it is difficult to debug. T o make it convenient
for debugging and testing, the daemon process is removed from the original M R T d
program. Further, since the multi-threading system is very hard to debug due to
constant thread switching by the process scheduler, i n the system debugging stage,
we set M R T d to r u n i n emulation mode i n a single process, just as i f running on
non-thread capable operating systems.
The other interesting point i n the implementation is that the client side applications such as p i n g or Netscape on the W i n d o w s platform exhibit different behavior
i n accepting the name resolution answer from those on L i n u x . Clients on W i n d o w s
accept answers sent from any port on the server system, while clients on L i n u x only
accept answers sent through the standard service port w i t h the number 53. T h i s
problem proves difficult, as we wrote the name resolution thread i n W i n d o w s and
tested it there, however a direct port d i d not work on the L i n u x platform.
Topology and Configuration of the Experiment
The network topology description, environment settings and tools used i n the experiment are covered i n this section.
General Description
We conducted experiments to test the viability of the system architecture and the
performance of the system.
In simulating an Internet web environment, we set up Apache web servers
to hold web pages, files etc..
A s usual, the CN itself is considered a black box
for content internetworking, so our simulation ignores the CN internal RequestR o u t i n g system.
We use agents residing on the web server to collect real-time
content availabilities and server side metric information, and these variations are
considered to be the origins of routing updates to be sent to the CIG.
We simulated a single A S using a L i n u x box running C B G P . Other machines
Played the role of web servers and CIG by running the corresponding software.
W i t h limited number of machines, however, we could have only one machine acting
as a multi-function box w i t h any combination of C B G , CIG and web server. T h e
CIG communicating w i t h a certain CBG is configured as the same A S number as
the C B G .
G N U web stress tool - OpenLoad [18], which provides near real-time performance measurements for web applications, is used to generate traffic for system performance evaluation and analysis. For example, "openload w w w . m y C o n t e n t l . c o m
50" simulates 50 clients requesting and accessing content w w w . m y C o n t e n t l . c o m
simultaneously. T h e name requests are sent to our Request-Routing system for resolution; after getting the I P address of a content server, Openload visits the content
Accordingly, content servers' load and network usage are affected.
metric variations are reflected i n the routing system. Thus, we can observe content
routing behaviors and the load balancing performance of the content routing system.
Openload is further modified to reflect the content route changes i n the process of
Experiment Network Topology
We used a Bay450 T-24 Switch w i t h V L A N capability to set up a simulation environment. T h e simulation environment contains 6 A S s w i t h CNs scattered
and 9
ACNs representing different content. We have clients sitting i n different A S s
access the content provided, and the overall performance data is recorded for comparison. T h e simulation topology is shown i n Figure 5-3. A l t h o u g h the simulation
scale is small, the characteristic features of the Internet structure are simulated, and
we p l a n to make a large scale simulation i n future work. T h e local preference of
each route coming into an A S is set to the same default value, and i n this special
environment, we set a very small weight (0.1) for b o t h the A S p a t h length and local
preference, while we give a heavy weight to the server side metric.
Figure 5.3: Network topology for the experiment
T h e above topology was configured w i t h the computers i n the lab. Computer
names, the A S numbers they belong to, their I P addresses, the role they played i n
the network and the content they hold (if they take content server roles), are listed
i n Table 5-1.
Some L i n u x boxes have several network cards installed to act as
routers, and hence have several network addresses. T h e whole experiment is done
on an internal network using the internal I P subnet 192.168.x.x. We also set up web
servers delegating pseudo domain names, such as,
as shown i n Table 5-1.
Evaluation Criteria
In the experiment, we mainly conduct a stress test for performance evaluation.
T h a t is, we simulate multiple clients' concurrent and intensive access to content.
IP address (es)
Role of
Content held
No. in
an AS
content server
content server
lancom, yahoo
lancom, yahoo
: CN-1
C B G , content
Netscape, elle
CIG, content
Netscape, elle
CIG, content
content server
content server
content server
Netscape, elle
Google, cnn,
Google, cnn,
lancom, yahoo
Table 5.1: The configuration of the testing environment
Client request response time, transactions completed per second, and total number
of completed requests are measured by Openload and recorded for comparison and
analysis. The load variations on the server side are also monitored.
We evaluate the system on the following aspects: first, we expect that if
there ia a large quantity of client requests, content that has more replicas will deliver better overall performance than that which has fewer or no replicas, despite
system management overhead. Second, with our load balancing strategy, content
servers should demonstrate good overall load distribution performance. Third, compared to traditional DNS resolution schemes, our approach incurs one extra level
of round trip time from the CIG to the CBG for non-replicated content request
resolution. However, we think this overhead should be acceptable in contrast to the
significant performance gains for resolving widely replicated content, which is most
frequently requested and accessed. Fourth, our internetworking architecture should
be compatible with the original border gateways, shown by good interoperability.
The experiment results are shown in the next section.
Data Collection and Analysis
We did testing and data analysis on the following aspects:
1) Overhead on the content name resolution:
In this test, we measured the overhead of the content name resolution process
(total for both request and response). The experiment is carried out on a content
route table of 50,000 entries. These entries are randomly generated domain names
with most of them in the .com domain and others in .org and .net.
The overhead we measured on a 667 MHz Pentium III system running Linux
2.4.13 is just 5.7 milliseconds for going through two hops of the content layer that
are CIG a n d CBG. Also, we believe most of this time is spent o n packet processing,
as any D N S server has to do.
T h e D N S delay measured by [32] is shown i n Table 5-2. It assesses the delay
by choosing top 100 U R L s w i t h largest number of Internet users a n d 100 random
U R L s . T h e 95th percentile delay listed i n the table is the delay sufficient to serve
95% of the queries.
D N S delay (ms)
Top 100 U R L
100 R a n d o m U R L
Average M e d i a n
Table 5.2: D N S QoS assessment i n the internet
Compared w i t h the data shown i n Table 5-2, we conclude that the overhead
i n our approach is negligible.
2) Other performance testing for content accessing:
In the set of tests conducted, we compare situations where server loads change
w i t h the timeline, the response times when different numbers of surrogate servers
exist, and overall request response time measured from different clients.
2.1) Server load comparison between two machines w i t h similar configuration:
B y having clients i n different A S s access continuously for 3
minutes, we observed the changing trend of the server load o n robson a n d celestial
as shown i n Figure 5-4 i n which server load is calculated i n self-defined units. Robson
and celestial are two machines w i t h similar system configurations, a 200MHz C P U ,
and 6 4 M and 9 6 M memory respectively. Server load is computed synthetically from
the C P U load as well as memory a n d swap space usage. F r o m the chart shown i n
the figure, we can see that these two servers have basically the same load during the
service period, which showed good load distribution.
- • — Robson
- * — Celestial
<S» (s)
Figure 5.4: L o a d comparison o n two similar configured machines
2.2) Faster average response i n case of more surrogates:
We measured the request response times i n cases when different numbers of
surrogates exist i n the system. For the content, we first start the
web server o n robson only, a n d then simulate 20 clients access concurrently from
b o t h venus and goodearth using Openload. T h e average response time measured
on venus is shown i n Figure 5-5 w i t h the line indicating "one surrogate." We then
start the web service o n celestial, a n d again measure the response time o n venus.
T h e data is shown i n the chart w i t h a line indicating "two surrogates". F r o m the
chart, we can see that apparently, the average request response time is faster i n case
where more surrogates exist.
2.3) Similar client response times w i t h the same amount of surrogates:
T h e chart i n Figure 5-6 shows the response times measured o n goodearth a n d
venus during the same period when they access the content of
F r o m the chart, we can see that the average response time from b o t h clients is quite
similar, which demonstrates the benefits of content internetworking.
3) Interoperability between B G P and C B G P :
To show the feasibility of step-by-step deployment, we also perform interoperability testing of B G P and C B G P using the following simple topology, as shown
i n Figure 5-7:
Figure 5.7: Interoperability testing topology
In the testing topology, the L i n u x box
Robson plays the role of the CIG and
the content server w i t h the agent r u n n i n g on it.
After venus, goodearth and grimface have established connections, we start
the CIG and the agent on the content server. We can see that the route table on
grimface has the entries shown i n Table 5-3 :
Route to server
www. go
Table 5.3: Content route table i n interoperability test
Immediately, the same content route table shown i n Table 5-3 appears on
venus too by traversing the n o n - C B G P area. T h i s proves that C B G P is interoperable w i t h B G P , and that routers upgraded to C B G P can provide customers w i t h
better content service.
Chapter 6
Conclusion and Future Work
A l t h o u g h content networks axe being deployed fairly actively on the Internet these
days, content internetworking is still a new and important area. T h e proprietary nature of most content routing designs makes them undesirable for global use, and there
is still no scalable and efficient way to interconnect content i n different networks to
handle increasing global demands for content access. T o address the problem, we
propose a hierarchical scalable architecture.
Our approach includes the following features: hierarchical internetworking
architecture, which lends itself well to good scalability; h y b r i d routing platform
integrating b o t h I P and name based routing, which is compatible w i t h existing i n ternet architecture and make good use of existing network function; fully distributed
routing mechanism, which eliminates the centralized node of the bottleneck; direct
participation i n the routing process of the server side metric, p r o v i d i n g more complete metrics for routing decision; load balance considerations i n content routing,
which has apparent advantages for load distribution.
Our scheme is the first that we are aware of to use integrated I P and namebased routing mechanisms. W h e n compared to most related solutions, our approach
pushes content naming information out into the network, which greatly reduces
the traffic volume made to the network infrastructure by application layer metric
Preliminary experiments show that our architecture is fully viable, and the
overhead of adding an extra level of indirection for non-replicated content request
is insignificant compared w i t h the benefit gained.
Our approach can be easily deployed to provide immediate benefits to ISPs
and their customers. It also helps to protect investment i n infrastructure by upgrading software only on the border gateway, which should scale at least to the demands
of content requests for popular content.
Future Work
We are considering the following work i n the future to further improve and verify
the viability of our approach.
1) Large scale simulation:
W h i l e we have conducted some experiments i n a size-limited testing environment, we have not done any simulation on a large network yet.
We plan to
use O P N E T or NS2 for this simulation, especially to simulate the refined solution
proposed i n the design.
2) Further consideration about scalability:
In the design of the framework, we can see that each CBG has to keep the
whole content route table for a l l the replicated content around the world. T h i s is
not very efficient when the amount grows very large. We can observe that not a l l
content is popular a l l over the world. There is content that is only popular i n a
certain area or region, although there may be many replicas around that area, it is
not necessary to propagate the content updates to a l l the CBGs i n the Internet. T h e
popularity of content can be measured through the usage at each network. Proper
methods need to be found to deal w i t h this problem.
3) Agent based content routing metric calculation:
There may exist different ways of calculating the server side metric i n each
CN. E v e n for denning server "load," different brands of hardware and different
types of operating systems react differently under the same set of load criteria.
Newer metrics have been developed recently that perform tests for response time
by tracking how quickly packets are responded to, which is the preferable metric i n
our proposed solution. Not a l l the CNs have this metric available however, so we
want to use mobile agents to coordinate and do conversion between different metric
measurements, and come up w i t h a mathematical correlation that's the same for a l l
servers, so as to make the use of those metrics i n a consistent and meaningful way.
[1] M . Day, B . C a i n , G . Tomlinson, and P . Rzewski. A M o d e l for Content internetworking(CDI).,
February 22, 2002
[2] M . Day, B . C a i n ,
G . Tomlinson,
networking(CDI) Scenarios.
and P. Rzewski.
scenarios-00.txt, February 25, 2002
[3] B a r b i r , B . C a i n , F . Douglis, M . G r e e n , . M . Hofmann, R . Nair, D . Potter,
http://www.ietf:org/Internet-drafts/dro/t-ie(/-c(ii-fcnown-regues<-roufm500.txt, February 22, 2002
[4] M . Green, B . C a i n , G . Tomlinson, S. Thomas, and P . Rzewski. Content internetworking Architectural Overview.
ietf-cdi-architecture-OO.txt, J u n , 2002
[5] B . C a i n , O . Spatscheck, M . M a y , and A . Barbir. Request-Routing Requirements for Content internetworking., February 22, 2002.
[6] L . A m i n i ,
S. Thomas,
and 0 .
cdi-distribution-reqs-OO.txt, February 22, 2002.
[7] D . G i l l e t t i , and R . N a i r . Content internetworking(CDI) Authentication, A u thorization,
and Accounting Requirements.
February 22, 2002.
[8] Y . Rekhter, and T . L i . A Border Gateway P r o t o c o l 4 ( B G P - 4 ) . RFC 1771,
[9] P . Mockapetris. D o m a i n Names
M a r c h 1995.
- Concepts
and Facilities. RFC 1034,
November 1987.
[10] P . Mockapetris. D o m a i n Names - Implementation and Specification. RFC 1035,
November 1987.
[11] M . Gritter, and D . R . Cherition. A New Next-Generation Internet Architecture,
[12] M . Gritter, a n d D . R . Cherition. A n Architecture for Content R o u t i n g Support
i n the Internet,
[13] C . Yang. Efficient support for content-based routing i n Web server clusters.
Proceeding of U S E N I X Symposium o n Internet Technologies and Systems
[14] University of M i c h i g a n and M e r i t Network. Multithreaded R o u t i n g Toolkit
( M R T ) project,
[15] M R T
[16] M R T User/Configuration
[17] M R T
h t t p : / / w w w . m e r i t , net
[18] O p e n L o a d 0.1.2 for L i n u x , June 26, 2001.
[19] L . A m i n i , H.Schulzrinne and A . Lazar. Observations from Router-level Internet
Traces, D I M A C S Workshop on Internet and W W W Measurement,
and M o d e l i n g , Piscataway, Feb. 2002
[20] Fast Internet Content Delivery w i t h FreeFlow, white paper, A k a m a i , 2000
[21] D . K a t a b i , and J . Wroclawski. A Framework for Scalable G l o b a l IP-Anycast
( G I A ) S I G C O M M ' 0 0 , Stockholm, Sweden
[22] Cooper, I., Melve, I. and G . Tomlinson. "Internet Web Replication and Caching
Taxonomy", R F C 3040, June 2000,
[23] Bassam H a l a b i , Internet R o u t i n g Architectures, 1995, C I S C O P R E S S
[24] Using the
P r o t o c o l for Interdomain
http: / /
[25] P . S . M . Sayal and P. Vingralek. Selection algorithms for replicated web servers.
In T h e 1998 S I G M E T R I C S / P e r f o r m a n c e Workshop on Internet Server Performance, June 1998
[26] K . O b r a c z k a and F . Silva. L o o k i n g at network latency for server proximity. T R
99-714, U S C / I n f o r m a t i o n Science Institute, 1999
[27] P . M c M a n u s . A passive system for server selection w i t h i n mirrored resource
from, June 1999
[28] C I S C O , h t t p : / / w w w . c i s c o . c o m / u n i v e r c d / c c / t d / d o c / p c a t / d d . h t m
[29] M . Grossglauser and B . Krishnamurthy. L o o k i n g for Science i n the A r t of Network Measurement, I W D C Workshop, Taormina, Italy, September 2001
[30] K . L . Johnson, J . F . C a r r , M . S . D a y and M . F . K a a s h o e k . T h e measured performance of content distribution networks. In proceedings of the 5
Web Caching and Content Delivery Workshop, M a y 2000
[31] L . Q i u , V . N . Padmanabhan, and G . M . Voelker. O n the Placement of Web
Server Replicas, infocom 2001
[32] Internet
http: // / quality_today.html
[33] J U N O S Internet Software Release 5.3,
[34] D a v i d M . N i c o l , Challenges I n U s i n g Simulation to E x p l a i n G l o b a l R o u t i n g
Instabilities, 2002 Conference on G r a n d Challenges i n Simulation, San
TX, January
[35] M . G a n , A h y b r i d Hierarchical Request-Routing Architecture for Content Internetworking, M . S c . Thesis, i n progress
[36] T . G . Griffin, and G . Wilfong. A n Analysis of B G P Convergence Properties,
S I G C O M M ' O O , Cambridge, U S A
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF