Perl Weekly Challenge 132: Hash Join

by Abigail

Challenge

Write a script to implement Hash Join algorithm as suggested by wikipedia.

1. For each tuple r in the build input R
   1.1 Add r to the in-memory hash table
   1.2 If the size of the hash table equals the maximum in-memory size:
       1.2.1 Scan the probe input S, and add matching join tuples to the
             output relation
       1.2.2 Reset the hash table, and continue scanning the build input R
2. Do a final scan of the probe input S and add the resulting join tuples
   to the output relation

Example

Input:

@player_ages = (
        [20, "Alex"  ],
        [28, "Joe"   ],
        [38, "Mike"  ],
        [18, "Alex"  ],
        [25, "David" ],
        [18, "Simon" ],
    );

@player_names = (
        ["Alex", "Stewart"],
        ["Joe",  "Root"   ],
        ["Mike", "Gatting"],
        ["Joe",  "Blog"   ],
        ["Alex", "Jones"  ],
        ["Simon","Duane"  ],
    );

Output:

Based on index = 1 of @players_age and index = 0 of @players_name.

20, "Alex",  "Stewart"
20, "Alex",  "Jones"
18, "Alex",  "Stewart"
18, "Alex",  "Jones"
28, "Joe",   "Root"
28, "Joe",   "Blog"
38, "Mike",  "Gatting"
18, "Simon", "Duane"

Discussion

Mapping instructions for a relation database to a language like Perl is never easy. But these instructions are particular tricky, mostly because of this:

If the size of the hash table equals the maximum in-memory size

In Perl, when you reach the maximum in-memory size, you have consumed all the memory the OS is willing to give you. Perls solution to a request for more memory if the OS isn't willing to give you more is to crash (which cannot be trapped). You will not have any memory left to do something useful.

Note that dealing with running out of memory is part of the challenge. We're not asked to implement the standard hash join algorithm. Right above the quoted algorithm, Wikipedia writes:

This algorithm is simple, but it requires that the smaller join relation fits into memory, which is sometimes not the case. A simple approach to handling this situation proceeds as follows:

So we really do have to create an algorithm which deals with running out of memory. (And this begs the question "which piss poor hardware is the weekly challenge running that such a small example runs out of memory?").

Now, there is a way out of this. If you building perl with the -DPERL_EMERGENCY_SBRK compilation option, and if you are building perl so it uses its own malloc (which it doesn't do by default on most platforms), then you can allocate some emergy memory using $^M.

This gives us a trappeble "running out of memory" event, but it's very unlikely the program will be left in a state we can continue. So, whatever algorithm we come up with, it will be extremely flimsy, and unlikely to work. Given that, what is presented below is untested; we could not be bothered to recompile perl.

Solution

Perl

First, we check whether the program is run by a perl which can actually trap an out of memory event. If not, that's it. Else, we allocate some emergency memory. (We're using 1 Mb of emergy memory. No idea whether that is realistic or not).

use Config;
use List::Util 'max';

BEGIN {
    die "No hash join for you -- recompile first!\n" unless
        $Config::Config {malloc_cflags} =~ /-DPERL_EMERGENCY_SBRK\b/ &&
        $Config::Config {usemymalloc} eq 'y';

    $^M = " " x (1 << 20);  # 1 Mb.
}

my @R = ...;  # @player_ages in the example
my @S = ...;  # @player_names in the example

my $idx_R_njk = 0;
my $idx_R_jk  = 1;
my $idx_S_jk  = 0;
my $idx_S_njk = 1;

We now create a __DIE__ handler, to trap an out of memory event. In that case, we call a subroutine called flush.

my %output;
$SIG {__DIE__} = sub {
    if ("@_" =~ /Out of memory/) {
        flush (\@S, $idx_S_jk, \%output)
    }
    else {
        die @_;  # Propagate
    }
};

The main loop of the program populates the output structure %output. This implements step 1.1 of the algorithm. The hash %output uses the ages as keys; each value is an array of first names.

foreach my $r (@R) {
    push @{$output {$$r [$idx_R_jk]}} => $$r [$idx_R_njk];
}

If we are running out of memory, flush is called. It processes the %output structure, and sends results to standard output. To get the memory needed, we undefine $^M, do our thing, release the memory used by %output, and set $^M again. This implements step 1.2 of the algorithm above.

sub flush ($S, $idx_S_jk, $output) {
    undef $^M;  # Release memory.

    # 
    # Scan $S. For each match in $output, output a line.
    #
    foreach my $entry (@$S) {
        if ($$output {$$entry [$idx_S_jk]}) {
            for (my $i = 0; $i < @{$$output {$$entry [$idx_S_jk]}}; $i ++) {
                printf qq [%2d, %-${max_width}s "%s"\n],
                                 $$output {$$entry [$idx_S_jk]} [$i],
                           '"' . $$entry [$idx_S_jk] . '",',
                                 $$entry [$idx_S_njk];
            }
        }
    }

    %$output = ();  # Reset output table

    $^M = " " x (1 << 20); # Claim emergy memory again
} 

As a final step, after the main loop, we have to call flush again. This is step 2 of the algorithm above:

flush (\@S, $idx_S_jk, \%output);

Find the full program on GitHub.


Please leave any comments as a GitHub issue.