I use this blog as a soap box to preach (ahem... to talk :-) about subjects that interest me.

Wednesday, April 18, 2012

A real small-world network

I am fascinated by small-world networks and, for once, I would like to analyse a real network of which I know the complete structure.  This would make possible for me to calculate its average path length and its clustering.

After thinking about it for a while, given the fact that I am interested in civil aircrafts, I though it would be nice to study the global air-route network.

Years ago, I had a list of all air sectors in existence.  It was in one of those telephone-book-sized manuals marked "ABC" and used by travel agents.  But I must have thrown it away before one of my many moves, and I suspect that nowadays no travel agent has even ever seen such a book.

I couldn't find anything useful on the IATA (International Air Travel Association) and ICAO (International Civil Aviation Organization) websites.

When I searched Google for "nonstop flights XXX", with XXX set to a IATA airport code (e.g., JFK for New York's Kennedy) it showed a list of all cities with non-stop flights to that airport.

"That's what I need!" I thought.  "I only need to download from somewhere a list of all airport codes and write a little script that interrogates Google for all of them.  Then, I can convert the city names to IATA codes to have a list of all airport pairs.  I'll have to merge the lists of airports located near the same city (e.g., FCO for Rome Fiumicino and CIA for Rome Ciampino), but I can do that.  Once I have all city pairs, I can build my network.

It sounded reasonable, but there was a catch: Google makes it very difficult to be able to execute a query programmatically.  So much so that I gave up.

I had to find another network.

Fortunately, I discovered the site http://nsflight.com.  It lists all pairs of US airports that are connected non-stop.  As it turned out, to limit the study to the US was a blessing in disguise, because the analysis of the air routes requires an immense amount of computing time.

On April 13th (a black Friday), I finally had my list of US cities with an airport (699) and a list of directly-connected city pairs.  That's when I discovered that it would take days and days to get a result.

What I needed to build my network was a program to merge non-stop pairs into one-stop pairs, then to two-stop pairs, and so on until all 699 nodes had been connected to each other.  That is an enormous number: 699 x 698 / 2 = 243,951.  Certainly, too big to be kept in memory.  I had to build a database with the pairs.  Here is the SQL script I used:
create database pairs;
create table pairs.pairs (
  one character(3),
  two character(3),
  len integer
  );
create index one_key on pairs.pairs (one, len);
create index two_key on pairs.pairs (two, len);
create index len_key on pairs.pairs (len);
create unique index pair_key on pairs.pairs (one, two);

As you can see, each record consists of a pair of three-letter city codes and a number, which represents the minimum number of intermediate stops (i.e., the minimum path length between the pair of nodes).

The fourth index associates to the pair of city codes a unique key.  This guarantees that each pair can only appear once in the database.  I decided that the nodes of each pair would always appear in alphabetical order.  This avoided useless duplications that I would have had to remove later.

After writing an SQL script to insert the 1-pairs (i.e., the pairs with a single link between them), I wrote a JSP script to discover all possible combinations of pairs "back to back" and insert them as 2-pairs.

The JSP script went through the hard-coded list of city-codes one by one and queried the database to obtain all pairs that had the current city-code in either position and the len column set to 1.  This returns a list of all pairs containing the current city-code.  Then, the scripts repeats the query for each one of the other city-code of each pair, which returns another list of pairs.

For example, "GAM" (Gambell AK) is directly connected with five other cities (Elim, Nome, Shaktoolik, Savoonga, and White Mountain, all in Alaska).  The first query in the script returns this list:
ELI GAM
GAM OME
GAM SKK
GAM SVA
GAM WMO

The script executes the second database query for all five.  For example, the query for Shaktoolik returns the following list:
ELI SKK
GAM SKK
KKA SKK
OME SKK
SKK SMK
SKK UNK

The script then combines the 1-pairs obtained with the second query with the original 1-pair GAM-SKK to form six tentative 2-pairs:
GAM SKK ELI
GAM SKK GAM
GAM SKK KKA
GAM SKK OME
GAM SKK SMK
GAM SKK UNK

While going through the 2-pairs, the script discards GAM-SKK-GAM because the two ends are identical, but tries to insert into the database the other five 2-pairs with len set to 2. The first one fails because GAM-ELI is already present with len = 1, but the other four succeed.

Once obtained all 2-pairs, I launched the script again but with the len parameter of the second query set to 2.  This gave me all 3-pairs, but I had to go through all cities in three chunks, because the computer ran out of memory.  Two factors contributed to this debacle: firstly, I had to launch the script from a web browser, which gobbled up quite a bit of memory; secondly, the script was executed within the Tomcat Java server, which also used up plenty of memory and was known in the past to have memory leaks.  Even without leaks, it would have probably failed.

At this point, I converted the JSP script into a Java application, and the Windows Task Manager told me that the usage of physical memory remained around 40%.

At the moment of writing, I am running the application with 1-pairs and 4-pairs to obtain 5-pairs.  It has been running for hours, and I expect it to run overnight.  The CPU usage remains well below 15%.  This means that the time goes into database access.  In fact, I can hear that the hard disk is working hard. I don't see how I could speed that up.  Perhaps I could create a virtual disk in memory and use that to store the database.  I am open to suggestions...

The total number of 1-, 2-, 3-, and 4-pairs is 195,550.  The program will still have to run for a while longer to reach the longest of the shortest paths between pairs of nodes.

Once I will have the list of all pairs, I will be able to calculate their average length.  I'll keep you posted.

No comments:

Post a Comment