Web Exploration & Cartography





Introduction


This article is aimed to be a resource for the curious venturer and explorer of the World Wide Web. It seeks to provide this individual with a wealth of resources left by those who have preceded him. A multitude of reasons may exist for exploration, since it can be both a method or an end in itself. It is however always driven by a desire for knowledge and understanding. Because of that, a central goal for the explorer is to archive and map the various impressions he or she encounters throughout his or her travels. The explorer seeks to do so in a fashion which renders ones findings clear and easily conceivable to others. There are multiple ways of exploring the web. The most common and accessible way of web exploration is expressed in the sporadic drifting on the tides of what the browser offers. The explorer surfs from link to link, flickering in awaitance of a different current that might lead him onto new links and new directions. Thereby steering the him onto a potentially indefinite and mostly unforeseen path of exploration.

It is important to note that exploration differs form search. In search the object of desire is already known and needs only to be sought out. Explorers and cartographers on the other hand seek to look at systems and territories in themselves. Inherently exploration as opposed to search involves a magnitude of uncertainties and risks. Risk which the traveler has to delicately steer or navigate trough, possibly leading to a mortal stumbling or tumbling along the edge of the blade.

One of the examples of such risks could be the treacherous nature of the search-engine. For when this Siren sings it provides only with results that lie within it's victim's line of expectation and desire. Entangled by it's call the explorer finds himself wound up in self-affirming echoes, unable to uncover truly unknown waters.

The web appears to every beholder in a different guise. It is therefore logical that every explorer will be able to detail only his personal voyage and will draw only his personal charts. For that reason it is that this guide aims not to provide with previously drawn maps and routes. Rather it seeks to provide future explorers with tools, methods and examples that produce the possibility for aspiring navigators to cross the normative boundaries of their own web-environment. A set of tools for the surfer to stop drifting and tame the winds to navigate the open seas.



Tools
traceroute

Sea Spiders

A sea spider or the so-called web crawler, is an more or less autonomous pathfinder. By certain predefined rules it is set loose to explore the web on own devices, reporting back to the instance who put him to seas within certain intervals.

Link to a python programmed sea spider

Link to the whole pack

Rather than to conduct a mapping of the route which is being followed, this specific specimen is engineered to continuously report back in a travel-journalistic style; reporting date and time, duration, destination. Additionally the crawler is programmed to download all images encountered per destination. A travel-journal is generated for each destination visited, every single composed on a sheet of A5-paper.



On a technical level, the crawler requires a 'link of departure'. The HTML-structure of the link of departure is processed or parsed to return a list of new links available within. At this point the script picks one of these links as a new 'link of departure', and succeeds by performing the similar parsing of the corresponding HTML-structure. This allows the web crawler to preform it's infamous 'crawl' - routing itself trough a possibly endless network of links, departing from one link, claiming the next for a moment, before jumping on to the yet another again.



Certain limits has been put up for the crawler, allowing it a very extended leach;

a) Choosing link

Described above the script continuously has to choose for a new link of departure, deliberately not parsing every single available link. This has been determined since it would amount to an unforeseeable accumulation of links to process if the script was to process each link and its children, grandchildren, great grandchildren etc. Please refer to the animation above. The style of choosing link has been left out to the mysteries of greater algorithms; random.

b) Rooting out

It has been important to streamline the variable lists of links into formats which are parse-able to the crawler. This helps clearing out some of the unnecessary obstacles in the future trajectory. Concretely this means rooting out links which consists of images; links with endings of .jpg, .png, .gif and ignoring links refering to css-sheets, javascripts, etc.

c) Environmental interest

If interest is found to be in a specific environment of the web, it is necessary to customize the crawler in a way that it won't leave this area of interest. This can be exemplified in forcing the crawler to stay inside a certain domain or platform. The script is not asked to ignore specific link-endings as in b), but only to list links which includes certain endings.

d) Avoiding icebergs

During development and testing of the latest script-version, several fatal encounters with malicious icebergs ended the fragile life of the sea spider. Amongst the icy-giants were: IOError, ValueError and UnicodeError. To avoid this abrupt ending, a safety-feature has been built into the script; Instead of plunging into obliteration the spider first pokes the glacier to see if it's passable. If not the spider returns to previous destination and tries out another of the available routes.

e) Newer the same

To make a true explorer out of the crawler, a feature was added to force the script to always make a new choice of link. This means that the crawler newer visits the identical link, and therefore always sets sail for unknown waters. Practically this feature commands the link, once it has been visited, to be appended to the list of items to be rooted out in b).

Repositories
Possible databases to explore?

Experiments
Writers of the article go hands-on.

TumblrJumpr - deep probing an internet phenomenon
-=II External Project-Site II=-

Dotcom Index
Visit sites you would never visit using the Dotcom Index. This script makes all possible URL combinations using the alphabet, 0-9 and the hyphen.

Field studies
Works by others to explore and document the web.

One Terabyte of Kilobyte Age

Internet Census 2012

The Wayback Machine

Spam
http://mac.sixfiles.com/dbase/screens/vostrom-holdings--inc.-layer-four-traceroute.jpeg layer four traceroute

.--   .-""-.     .   )      .     (     .     /     )     .    (_    _)                     0_,-.__     .      (_  )_                     |_.-._/     .                           |lulz..\         .        (__)                     |__--_/               .     |   ``\                   |     .     | [Lulz] \                  |      /b/     .     |         \  ,,,---===?A`\  |  ,==y'     .   ___,,,,,---==""\        |M] \ | ;|\ |>     .           _   _   \   ___,|H,,---==""""bno,     .    o  O  (_) (_)   \ /          _     AWAW/     .                     /         _(+)_  dMM/     .      \@_,,,,,,---=="   \      \\|//  MW/     .--''"                         ===  d/     .                                    //       .                                    ,'_________________________     .   \    \    \     \               ,/13:14, 31 March 2013 (CEST)13:14, 31 March 2013 (CEST)13:14, 31 March 2013 (CEST)13:14, 31 March 2013 (CEST)13:14, 31 March 2013 (CEST) .                        _____    ,'  Roelroscama (talk)   .-""-.13:14, 31 March 2013 (CEST)~  .-""-. .     .-""-.           ///==---   /`-._ ..-'      -.__..-'     .            `-.__..-' =====\\\\\\ V/  .---\. .                    13:14, 31 March 2013 (CEST)13:14, 31 March 2013 (CEST), _',--/_.\  .-""-. .                           .-""-.___` --  \|         -.__..-