1. reduce() and combine()
    1. Boolean operators

Databases

Back in Section ??? we saw the beginnings of a database system that would manufacture an iterator containing the results of a simple query. To open the database we did

        my $dbh = FlatDB->new($datafile);

and then to perform a query,

        $dbh->query($filename, $value);

or

        $dbh->callbackquery(sub { ... });

which selects the records for which the subroutine returns true.

Let's extend this system to handle compound queries. Eventually, we'll want the system to support calls like this:

        $dbh->select("STATE = 'NY' | 
                      OWES > 100 & STATE = 'MA'");

This will require parsing of the query string, which we'll see in detail in Chapter ???. In the meantime, we'll build the internals that are required to support such queries.

The internals for simple queries like "STATE = 'NY'" are already done, since that's exactly what the $dbh->query('STATE', 'NY') does. We can assume that other simple queries are covered by similar simple functions, or perhaps by calls to callbackquery(). What we need now are ways to combine simple queries into compound queries.

The i_and() and i_or() functions we saw earlier will do what we want, if we modify them suitably. The main thing we need to arrange is to define a canonical order for records produced by one of the simple query iterators. In particular, we need some way for the i_and() and i_or() operators to recognize that their two argument iterators have generated the same output record.

The natural way to do this is to tag each record with a unique ID number as it comes out of the query. Two different records will have different ID numbers. For flat-file databases, there's a natural record ID number already to hand: the record number of the record in the file. We'll need to adjust the query() function so that the iterators it returns will generate record numbers. When we last saw the query() function, it returned each record as a single string; this is a good opportunity to have it return a more structured piece of data:

Download code for FlatDB_Composable.pm

        package FlatDB_Composable;
        use base 'FlatDB';
        use base 'Exporter';
        @EXPORT_OK = qw(query_or query_and query_not query_without);
        use Iterator_Logic;

        # usage: $dbh->query(fieldname, value)
        # returns all records for which (fieldname) matches (value)
        sub query {
          my $self = shift;
          my ($field, $value) = @_;
          my $fieldnum = $self->{FIELDNUM}{uc $field};
          return unless defined $fieldnum;
          my $fh = $self->{FH};
          seek $fh, 0, 0;
          <$fh>;                # discard header line
          my $position = tell $fh;
          my $recno = 0;

          return sub {
            local $_;
            seek $fh, $position, 0;
            while (<$fh>) {
              chomp;
              $recno++;
              $position = tell $fh;         
              my @fields = split $self->{FIELDSEP};
              my $fieldval = $fields[$fieldnum];
              return [$recno, @fields] if $fieldval eq $value;
            }
            return;
          };
        }
=test query-composable
        use FlatDB_Composable;
        do 'query-composable';
        my $dbh = FlatDB_Composable->new('Programs/db.txt') or die $!;

        my $q = $dbh->query('STATE', 'NY');
        # assume order of records
        is_deeply($q->(),[1,'Adler','David','New York','NY','157.00']);
        is_deeply($q->(),[5,'Schwern','Michael','New York','NY','149658.23']);
        is($q->(),undef);
=endtest query-composable

It might be tempting to try to use Perl's built-in $. variable here instead of carrying our own synthetic $recno, but that's a bad idea. We took some pains to make sure that a single database filehandle could be shared among more than one query. However, the information for $. is stored inside the filehandle; since we don't want the current record number to be shared between queries, we need to store it in the query object (which is private) rather than in the filehandle (which isn't). An alternative to maintaining a special $recno variable would be to use $position as a record identifier, since it's already lying around, and since it has the necesary properties of being different for different records and of increasing as the query proceeds through the file.

Now we need to manufacture versions of i_and() and i_or() that use the record ID numbers when deciding what to pass along. Because these functions are curried, we don't need to rewrite any code to do this:

=contlisting FlatDB_Composable.pm
        BEGIN { *query_or  =  i_or(sub { $_[0][0] <=> $_[1][0] });
                *query_and = i_and(sub { $_[0][0] <=> $_[1][0] });
              }
=contlisting FlatDB_Composable.pm invisible
        BEGIN { *query_without = i_without(sub { $_[0][0] <=> $_[1][0] }); }

        sub callbackquery {
          my $self = shift;
          my $is_interesting = shift;
          my $fh = $self->{FH};
          seek $fh, 0, SEEK_SET;
          <$fh>;                # discard header line
          my $position = tell $fh;
          my $recno = 0;

          return sub {
            local $_;
            seek $fh, $position, SEEK_SET;
            while (<$fh>) {
              $position = tell $fh;         
              chomp;
              $recno++;
              my %F;
              my @fieldnames = @{$self->{FIELDS}};
              my @fields = split $self->{FIELDSEP};
              for (0 .. $#fieldnames) {
                $F{$fieldnames[$_]} = $fields[$_];
              }
              return [$recno, @fields] if $is_interesting->(%F);
            }
            return;
          };
        }

        1;

The comparator function says that arguments $_[0] and $_[1] will be arrays of record data, and that we should compare the first element of each, which is the record number, to decide which data should come out first and to decide record identity.

In Chapter ???, we'll build a parser which, given this query:

        "STATE = 'NY' | OWES > 100 & STATE = 'MA'"

makes this call:

        query_or($dbh->query('STATE', 'NY'),
                 query_and($dbh->callbackquery(sub { $F{OWES} > 100 }),
                           $dbh->query('STATE', 'MA')
                          ))

and returns the resulting iterator. In the meantime, we can manufacture the iterator manually.

=test query-or
        use Iterator_Logic;
        BEGIN {
        *query_or  =  i_or(sub { $_[0][0] <=> $_[1][0] });
        *query_and = i_and(sub { $_[0][0] <=> $_[1][0] });
        }

        use FlatDB_Composable;
        do 'query-composable';
        my $dbh = FlatDB_Composable->new('Programs/db.txt') or die $!;
        my $q = query_or($dbh->query('STATE', 'NY'),
                        query_and($dbh->callbackquery(sub { $F{OWES} > 100 }),
                                  $dbh->query('STATE', 'MA')
                                 ));
        is_deeply($q->(),[1,'Adler','David','New York','NY','157.00']);
        is_deeply($q->(),[5,'Schwern','Michael','New York','NY','149658.23']);
        is($q->(),undef);
=endtest query-or

The one important logical connective that's still missing is 'not'. 'not' is a little bit peculiar, logically, because its meaning is tied to the original database. If $q is a query for all the people in a database who are male, then query_not($q) should produce all the people from the database who are female. But the query_not function can't do that without visiting the original database to find the female persons. Unlike the outputs of query_and() and query_or(), the output of query_not() is not a selection of the inputs.

One way around this is for each query to capture a reference back to the original database that it's a query on. An alternative is to specify the database explicitly, as $dbh->query_not($q). Then we can implement a more general operator on queries, the so-called set difference operator, also known as without.

=contlisting Iterator_Logic.pm
        # $a but not $b
        sub i_without_ {
          my ($cmp, $a, $b) = @_;
          my ($av, $bv) = ($a->(), $b->());
          return sub {
            while (defined $av) {
              my $d;
              while (defined $bv && ($d = $cmp->($av, $bv)) > 0) {
                $bv = $b->();
              }
              if ( ! defined $bv || $d < 0 ) {
                my $rv = $av; $av = $a->(); return $rv;
              } else {
                $bv = $b->();
                $av = $a->();
              }
            }
            return;
          }
        }

        BEGIN {
          *i_without = curry \&i_without_;
          *query_without = 
            i_without(sub { my ($a,$b) = @_; $a->[0] <=> $b->[0] });
        }

        1;

If $a and $b are iterators on the same database, query_without($a, $b) is an iterator which produces every record that appears in $a but not in $b. This is useful on its own, and it also gives us a base for 'not', which becomes something like this:

=contlisting FlatDB_Composable.pm
        sub query_not {
          my $self = shift;
          my $q = shift;
          query_without($self->all, $q);
        }

$self->all is a database method which performs a trivial query that disgorges all the records in the database. We could implement it specially, or, less efficiently, we could simply use

        sub all {
          $_[0]->callbackquery(sub { 1 });
        }

        1;

A possibly amusing note is that once we have query_without(), we no longer need query_and(), since (a and b) is the same as (a without (a without b)).

=test query-not 5
        use FlatDB_Composable;
        # MJD ???   
        # *FlatDB::query_without = i_without(sub { my ($a,$b) = @_; $a->[0] <=> $b->[0] });
        # do 'query-not';
  
        my $dbh = FlatDB_Composable->new('Programs/db.txt') or die $!;
        my $q = $dbh->query_not($dbh->query('STATE', 'NY'));
  
        # problem #1: the wrong query function is beign called.  we're
        # getting a string instead of an arrayref.  it has something to do
        # with the query_not.
  
        #while (my $t = $q->()) {
        #  ok(! ref $t, "$t is a plain string");
        #  unlike( $t, qr/:NY:/, "$t not in NY" )
        #}
  
        # arrayref version
        while (my $t = $q->()) { 
          print "# @$t\n";
          isnt( $t->[4], "NY", "$t->[4] is not NY");
          $count++;
        }
        is($count, 4, "four non-NY records");
=endtest query-not

Operator Overloading

Here ??? Perl provides a feature called operator overloading that lets us write complicated query expressions more conveniently. Operator overloading allows us to redefine of Perl's built-in operator symbols to have whatever meaning we like when they are applied to our objects. Enabling the feature is simple. First we make a small change to methods like query() so that they return iterators that are blessed into package FlatDB. Then we add

Download code for FlatDB_Overloaded.pm

        package FlatDB_Overloaded;
        BEGIN {
          for my $f (qw(and or without)) {
            *{"query_$f"} = \&{"FlatDB_Composable::query_$f"};
          }
        }
        use base 'FlatDB_Composable';

        sub query {
          $self = shift;
          my $q = $self->SUPER::query(@_);
          bless $q => __PACKAGE__;
        }

        sub callbackquery {
          $self = shift;
          my $q = $self->SUPER::callbackquery(@_);
          bless $q => __PACKAGE__;
        }

        1;
=contlisting FlatDB_Overloaded.pm
        use overload '|' => \&query_or,
                     '&' => \&query_and,
                     '-' => \&query_without,
                     'fallback' => 1;

at the top of FlatDB.pm(). From then on, any time a FlatDB object participates in an | or & operation, the specified function will be invoked instead.

Now, given the following simple queries:

        my ($ny, $debtor, $ma) = 
                ($dbh->query('STATE', 'NY'),
                 $dbh->callbackquery(sub { $F{OWES} > 100 }),
                 $dbh->query('STATE', 'MA')
                );

we'll be able to replace this:

        my $interesting = query_or($ny, query_and($debtor, $ma))

with this:

        my $interesting = $ny | $debtor & $ma;

The operators are still Perl's built-in operators, and so obey the usual precedence and associativity rules. In particular, & has higher precedence than |.

=test query-overload
        use FlatDB_Overloaded;
  
        my $dbh = FlatDB_Overloaded->new('Programs/db.txt') or die $!;
        my ($ny, $debtor, $ma) = 
                       ($dbh->query('STATE', 'NY'),
                        $dbh->callbackquery(sub { $F{OWES} > 100 }),
                        $dbh->query('STATE', 'MA')
                       );

        my $q = $ny | $debtor & $ma;
        print "# --$q--\n"; # <-- perl bug?  note the non hex values
  
        is_deeply($q->(),[1,'Adler','David','New York','NY','157.00']);
        is_deeply($q->(),[5,'Schwern','Michael','New York','NY','149658.23']);
        is($q->(),undef);

        ($ny, $debtor, $ma) = 
                       ($dbh->query('STATE', 'NY'),
                        $dbh->callbackquery(sub { my %F=@_; $F{OWES} > 200 }),
                        $dbh->query('STATE', 'MA')
                       );

        my $q = $ny & $debtor | $ma;
        print "# --$q--\n"; # <-- perl bug?  note the non hex values
  
        is($q->()->[0],2);
        is($q->()->[0],4);
        is($q->()->[0],5);
        is($q->(),undef);
=endtest query-overload
TOP