A cheap way to generate a graphic of web site pages relations…


Here is a Ruby script that generates the tree of the pages of a web site in OPML, XOXO and DOT formats.
It also saves its work in YAML format so do don't need to scan the same site several times, if you just want to produce another representation of the site. Feel free to adapt to your needs.
By using the DOT output with GraphViz, you can generate any graphic you like.

Usage:

Usage: httptree.rb [options] (url ... | -i file)
-d, --debug turn on debugging
-p, --pools [INTEGER] number of thread to use
-v, --verbose turn on verbose mode
-t, --title [STRING] document title
-n, --ndeep [INTEGER] number of levels to walk thru URLs (default: 999999)
-r, --nredirection [INTEGER] max number of redirection to follow when fetching URLs (default: 10)
-f, --format [STRING] output format [OPML | XOXO | DOT] (default OPML)
-o, --output [STRING] output data structure to YAML format
-i, --input [STRING] input file in YAML format from previous execution with -o option
-c, --colors [STRING] string is "mime=color;..." or @filename (YAML format)
-C, --dumpcolors [STRING] dump colors to filename (YAML format)
-e, --excluded [STRING] string is "extension=mime;..." or @filename (YAML format)
-E, --dumpextenstions [STRING] dump excluded extensions to filename (YAML format)
-S, --separate outputs one DOT layer per depth level found (usefull only if option -o DOT is also selected)
-D, --dumpdefaults [STRING] dump defaults do filename and quits (any option before -D is taken into account)
-I, --usedefaults [STRING] read defaults from file
-h, --help Show help and quits

List of require's :

require 'net/http'
require 'thread'
require 'optparse'
require 'rexml/document'
require 'rexml/streamlistener'
require 'set'
require 'rubygems'
require 'rubyful_soup'
require 'logger'
require 'yaml'