I use this blog as a soap box to preach (ahem... to talk :-) about subjects that interest me.

Thursday, August 26, 2010

Network of Feedbacks on eBay

A network is a collection of nodes connected by links. Railway tracks, roads, blood vessels in our bodies, neurons in our brains, tunnels in ant nests, cables in electricity grids, airways, and sewer and water pipes are all examples of physical networks we are familiar with. The study of physical networks was first formalised during the first half of the 18th century by Leonhard Euler, who invented graph theory. Two hundred years later, to study complex networks and how they were formed, Paul Erdős and Alfréd Rényi studied fictitious networks in which the links were created at random. Random network theory dominated the study of networks for decades, but it had a critical flaw: all nodes had the same number of links. That’s not how most real networks are.



Consider for example the social network supported online by FaceBook: some members have hundred or even thousands of friends, while the vast majority of members only have a handful of contacts. The super-connected nodes are called hubs and, in a sense, represent the glue that keeps the network together. To understand this, consider that if you befriend a super-connected person on FaceBook, her wast network of friends will only be one hop away from you. You will suddenly have access to hundreds and hundreds of people who were unreachable to you only a few minutes before. By bringing many nodes to be only two hops away from each other, the hubs effectively shrink the size of the network.

If you are fascinated by network science as I am, you must read “Linked”, by Albert-László Barabási. It is one of the few books that I bought immediately after reading a copy I had taken from a library.


What I find particularly interesting concerning networks is the concept of small-world networks. In a nutshell, small-world networks are networks in which some long-distance links allow you to reach all the nodes in just a few hops. To understand how it works, consider a network consisting of ten people standing in circle. To give an object to somebody on the opposite side of the circle, you have to give it to your neighbour and ask her to pass it on from neighbour to neighbour until it reaches its destination. This means that the object will have to go through five changes of hand (i.e., five links). On average, to be delivered to a node, the object will have to go through 25/9 = ~2.8 links (clockwise or anticlockwise, from one to five hops away). Now suppose that you have a direct long-distance connection to the fourth person on your left. The average distance of a node goes down to 17/9 = ~1.9 links (see Figure 1, where the numbers of hops to reach all nodes are shown).

Figure 1: A small world

Figure 1 shows that in the “small-world circle” eight nodes have two links, while two nodes (you and the fourth on your left) have three links. With large networks in the real world, if you calculate the number of nodes N as a function of the number of their links L, you find out that N is proportional to L, where γ is a small number. For example, the number of web pages pointed to by a number of links equal to Lin is proportional to Lin-2.1. Similarly, the number of web pages that contain Lout hyperlinks to other pages is proportional to Lout-2.5. These statistical dependencies are called power law distributions, and are well known to physicists, because they pop up in several places. I personally encountered power laws when I worked on a radiative model of a supernova remnant for my thesis in Physics.

I wanted to study a small-world network, but I needed a network with enough data for a statistical analysis and to which I had full access. I found one in eBay’s feedback pages. If you go to:


you see the first of my feedback pages (&userid=bloody-me&page=1), containing up to 200 feedbacks (&items=200). Each item contains either the member ID of a seller or that of a buyer. Therefore, by following those IDs, you can crawl the whole feedback tree. I wrote a JSP program that did just that. It requested my page of feedback and extracted from it the IDs of the people who left for me a feedback. Then, it loaded their feedback pages and did the same with them. The program kept loading pages and following IDs for as long as I wanted. While crawling through the links, the program counted them, thereby providing the data I needed to see whether the distribution of IDs with respect to their number of links followed a power law. To avoid artificial distortions of the distribution, I took care of removing duplicates, which occurred both because some people bought more than one item from the same seller and because the program saw each link twice: from the buyer side and from the seller side. 

I stopped the iteration after one thousand IDs (actually, in the end, they were 1004), and the program counted a total of more than three million unique feedback items (i.e., links). The results were as follows



Links range
#ID
1 to 100
306
101 to 500
381
501 to 2500
186
2501 to 12500
84
12501 to 62500
34
62501 to 312500
13

Notice that, in order to have statistically significant counts of IDs, I compensated for the greater rarity of larger hubs by increasing the range of link numbers. Due to the increased range, though, I had to correct the number of IDs before being able to plot them. I did so by dividing the number of IDs by the range width and then multiplying the result by 100. The resulting table was as follows:



Midrange Value
Corrected #ID
50.5
309.091
300.5
95.489
1500.5
9.305
7500.5
0.840
37500.5
0.068
187500.5
0.005

Finally, I was able to plot the results with logarithmic scales on both axes. Power laws appear as straight lines in bi-logarithmic plots and are therefore easy to detect at a glance.


The crosses indicate the values show in the table. As we are dealing with statistical data, we cannot be sure that if we repeated the measure the crosses would be in exactly the same position. The error bars correspond to the 3σ statistical uncertainty. That is, they show the interval that contains the asymptotic value (i.e., resulting from an infinitely large sample) with a probability of 99.73%. In any case, error bars or not, the crosses seem to be very well aligned, with the exception of the first one.

First of all, I repeated the plot without the first point, and used a spreadsheet to interpolate a straight line. Here is what I got:


This time, I also added horizontal error bars, although in fact they turned out not to play any role. The five points are aligned on a power law with γ = -1.5260. The red and blue lines are the 3σ boundaries. They tell us that γ most likely is between -1.6389 and -1.4055 although, as you can see, the best-fit line is straight through the crosses, which means that the actual value of γ is not likely to depart significantly from the best-fit value of -1.5260. The coefficient of determination R2 is very close to one, thereby confirming that we wouldn’t expect large variations if we repeated the measure.

Why is the meaning of a γ of -1.5260? I have no idea. It is “sort-of half the way” between -1 and -2. -1 would mean that a node with twice as many links as another node would be twice as rare. -2 would mean that a node with twice as many links as another node would be four times as rare. That is, that big hubs would be much more rare. -1.5260 is somewhere in between...

A question needs to be answered: why is the first point not aligned with the other five. Or: why are the nodes with not more than 100 links less frequent than what could be expected when looking at the distribution of larger nodes? N(50.5) = 628136 * 50.5-1.5260 = ~1581, while the program only counted 309 nodes.

The following plot shows the number of nodes with up to 500 links. The red line represents the power law interpolated from the previous plot.

As you can see, there is a gradual departure from the power law distribution. Please note that the placement of the trend on this plot is somewhat arbitrary: I just forced the power law to pass through the point with 500 links. Therefore, the fact that the curve seems to depart from the trend line at 200 rather than 100 links is not significant. To eliminate the influence of this effect on the calculation of γ I might have to remove the counts for up to 200 links instead of 100, but it doesn’t seem to be worth the effort.

Let me formulate the issue in one more way: why are there fewer than expected feedbacks associated with eBay customers that have comparatively few feedbacks? Perhaps, the eBay users with hundreds of transactions recognise the value of feedbacks and make a more systematic use of them. My statistics refer to feedbacks received, but the consideration could still apply. I wouldn’t be surprised if the eBay stores had a policy of returning a feedback only to the buyers who give a feedback to them. Then, if people who buy from eBay stores but not often tended not to leave feedbacks, they would receive few feedback as well, thereby “depressing” the curve.

Another possible factor could be that I have considered all feedbacks, both those received as buyers and those received as sellers. It could be that if I only considered the feedback received for selling, the power law would apply to lower numbers of feedbacks. Or perhaps not... In any case, it took me several days to collect the data in small batches and, I confess, I have at the moment no fun to do it all over again to check out the sellers...

It is an interesting problem, though...

No comments:

Post a Comment