By Chris Sherman Posted On May 22, 2000
A new map of cyberspace shows that the Web resembles a bow tie, with divisive boundaries that can make navigation between regions difficult or even impossible, according to a new study published by researchers at AltaVista, Compaq, and IBM. Previous theories suggested that the Web was highly connected, with no more than 19 degrees of separation from any one site to another. By contrast, the new map reveals a subtler structure that may lead to more efficient search engine crawling techniques and a greater understanding of the sociology of content creation, and that may help predict the emergence of new phenomena on the Web such as Web rings and spam clusters. (For a full-size graphic representation of the bow-tie theory, click the thumbnail image at right.)
"We realized that something is going on-that the model where everyone is linked to everyone else really doesn't quite work," said Andrei Broder, AltaVista's vice president of research and lead author of the study. "The old picture-where no matter where you start, very soon you'll reach the entire Web-is not quite right."
The study analyzed the connectivity of more than 1.5 billion links on over 200 million Web pages. Links are the glue that connect pages on the Web, and they're vital for navigation. People rely on them to browse the Web, and search engines rely on them to locate pages to add to their indexes. By creating a map showing all links pointing to and from this large sample of pages, the researchers aimed to create a comprehensive topography of the Web.
They found some surprising statistics, according to Broder. Unlike previous models portraying the Web as clusters of sites forming a well-connected sphere, the results showed that the Web's structure more closely resembles a bow tie consisting of three major regions (a knot and two bows), and a fourth, smaller region of pages that are "disconnected" from the basic bow-tie structure.
At the center of the bow tie is the knot, which the study calls the "strongly connected core." This core is the heart of the Web. Pages within the core are strongly connected by extensive cross-linking among and between themselves. Links on core pages enable Web surfers to travel relatively easily from page to page within the core. They are also the links most likely to be followed by search engine indexing spiders.
The left bow consists of "origination" pages that eventually allow users to reach the core, but that can't themselves be reached from the core. Origination pages are typically new or obscure Web pages that haven't yet attracted interest from the Web community (they have no links pointing to them from pages within the core), or that are only linked to from other origination pages. Relatively closed community sites such as Geocities and Tripod are rife with origination pages that often link to one another but are seldom linked to from pages within the core.
The right bow consists of "termination" pages that can be accessed via links from the core but that do not link back into the core. "Many commercial sites don't point anywhere except to themselves," said Broder. Instead, corporate sites exist to provide information, sell goods or services, and otherwise serve as destinations in themselves, and there is little motivation to have them link back to the core of the Web.
The final region contains "disconnected" pages that aren't part of the bow tie. These pages can be connected to origination and/or termination pages but are not directly accessible to or from the connected core.
Perhaps the most surprising finding is the size of each region. Intuitively, one would expect the core to be the largest component of the Web. It is, but it makes up only about one-third of the total. Origination and termination pages both make up about a quarter of the Web, and disconnected pages about one-fifth.
Broder believes the core is growing faster than the other regions because Web authors are increasingly aware of the importance search engines place on link analysis in calculating relevance. "People try to get links from the core. If you find someone in the core to point to you, and you point back, you're in the core," said Broder, wryly noting that he soon expects an increase in dubious services offering Web authors who are hungering for visibility the ability to "connect yourselves to the core of the Web for $29.95."
What about the explosion in e-commerce and B2B sites? Won't the proliferation of commercial sites swell the population of termination pages? "Commercial sites grow very fast, but don't grow so fast in terms of pages we want to crawl," said Broder. Many are catalog pages and of little value in a general purpose search engine, he said.
Other notable findings from the study include the following:
For any randomly chosen source and destination page, the probability that a direct hyperlink path exists from the source to the destination is only 24 percent.
If a direct hypertext path does exist between randomly chosen pages, its average length is 16 links. In other words, a Web browser would have to click links on 16 pages to get from random page A to random page B. This finding is less than the 19 degrees of separation postulated in a previous study, but it also excludes the 76 percent of pages lacking direct paths.
If an undirected path exists (meaning links can be followed forward or backward, a technique available to search engine spiders but not to a person using a Web browser), its average length is about 6 links.
More than 90 percent of all pages on the Web are reachable from one another by following either forward or backward links. This is good news for search engines attempting to create comprehensive indexes of the Web.
The bow-tie map of the Web offers a much more three-dimensional model for understanding the Web than we've had in the past. Broder hopes to continue the research. "We still have to do more analysis, and also build models to understand how the Web evolves," he said. "If nothing else, we'll try to see if the structure is maintained."
Whether the Web maintains its bow-tie structure, or morphs into something completely different, will be fascinating to observe. After all, the Web has just grown by about 50 pages and 500 links during the time you took to read this single sentence. Whether this new growth conforms to the bow-tie model, or extends the Web, amoeba-like, into another structure that we can only guess at, remains to be seen.
Chris Sherman is president of Searchwise, a Boulder, Colorado-based Web consulting firm, and associate editor of Search Engine Watch. Email: csherman@searchwise.com