Tuesday 2 July 2013

Find the Median of a Large Set of Number

Task:
There are 10G disordered numbers, you need to find out the median. You have 2G bytes memory.

Solution:

The simplest method is sorting all numbers then find the middle item:

1  2  3  … 49  50  51  … 98  99  100

There are two defects:

1. Not necessary to sort all numbers
2. Occupies too much memory

Improved:

Create a data table which count the numbers in separated ranges:
(After first scan)

Range 1-20 21-40 41-60 61-80 81-100
Count 20 20 20 20 20

The median must in range [41-60] while the offset is 10 which means the 10th item in the range block is the median

Next, program will scan data set again, and only copy 20 ranged numbers into memory. Do a sort and grab the 10th item.

=======

Assume the numbers are unsigned 64bit integer. So each number occupies 8 bytes. 2G ram could handle no more than 256M numbers each time. The 10G number set could be separated into 40 groups.

Then, define several ranges from 0 to 2^64 -1, looks like arithmetic sequence:

( Pseudo code )

int64 a = 2^64 / 4096;

int64[4096] b ;

Array b statistics the volume of numbers hit in the range. In current code, each b in the array maps a range of 2^64 / 4096 = 2^52

For instance, if there is a number 68846541654532100.

68846541654532100 / (2^52) = 15.287

will lead to:

b[15]++;

 

The program do a initial process, read 256M numbers each time, for 40 main loops. In each loop, compare function will find out the range of each number and update the b[4096]. When loops done, we will get a full record b[4096].

Find the Median range:

int64 sum=0;
int64 prev=0;
int i;
for (i=0 ; i < 4096 ; i++)
{
  prev=sum;
  sum += b[i];
  if (sum >= 2^64 / 2)
    break;
}
int64 offset=(2^64 /2 – prev);
print( i );

Median must located in:
from  i*a  to  (i+1)*a-1
and must be the offset=(2^64 /2 – prev) item in the target block.

Program will scan 10G number set again, but only pick the numbers in the specific range and write them to file system.

IF the final file is less than 2G, sort the content, and pick the ‘offset’ item.

IF the final file is greater than 2G, separate it again as previous steps but smaller range. Do not forget to translate the offset.