elsif ($count == 0)
{
delete $$tref
{
$chunk
}
;
}
else
{
die
"Decrementing a chunk (
\
"$chunk
\
") not in the table";
}
}
return $tref;
}
We would be justified in working fairly hard to make this computation effi-
cient; fortunately,
Perl
has done a very good job of implementing hash tables,
so the basic lookup implied by
{
$chunk
}
will do fine for most conceivable
applications.
Notice that there are no explicit hash objects in the two routines (hash
object names would start with
"%"
). That’s a bit startling and not ideal for
reading the code, but natural to the use of references in
Perl
. Whenever we

318
CHAPTER 8.
COMPUTING WITH TEXT
access or set an element of the hash table, we refer to the scalar element
(a leading
"$"
) of the reference to the table, which is itself a scalar,
$tref
;
hence, the prevalent idiom of
$$tref$chunk
.
These routines do some error checking, and use the standard
Perl
die
statement.
If that sounds a bit drastic in code intended to be used from
R
, not to worry.
The
RSPerl
interface does a nice job of wrapping the
resulting error message and exiting the calling
R
expression cleanly, with no
permanent damage.
8.6
Examples of Text Computations
In this section we examine or re-examine some examples, looking both at
R
and
Perl
.
Choosing and designing computations for text data involves
many tradeoffs. Nearly any example can be treated in multiple ways, more
than one of which might be suitable depending on the experience of the
programmer and/or the size or detailed characteristics of the application.
The examples illustrate some of the choices and the tradeoffs they in-
volve.
Data with repeated values
A common departure from strictly “rectangular” or data-frame-like struc-
ture comes when some variables are observed repeatedly, so that the ob-
servation is not a single number or quantity, but several repetitions of the
same quantity. If the number of repetitions varies from one observation to
the next, the data has a list-like structure: in
R
terminology, each observa-
tion is an element in the list consisting of a vector of the values recorded for
that observation. Either
R
or
Perl
can deal with such data in a simple way.
The differences are useful to consider.
To import such data, there must be a way to distinguish the repeated
values from other variables. The simplest case diverts the repeated values to
a separate file, written one line per set of repeated observations. In Section
8.2, page 296, we showed a computation for this case, based on reading the
lines of repeated values as separate strings and then splitting them by calling
strplit()
. Here’s an alternative, allowing a more flexible form of data. In
this version, successive lines may have different formats, provided each of
the lines is interpretable by the
scan()
function. The lines might come in
pairs, with the first line of each pair having non-repeated variables and the
second the repeated values.

8.6.
EXAMPLES OF TEXT COMPUTATIONS
319
For example, the first line might be data for a state in the United States,
with the abbreviation, population, area, and center (as in the
state
data of
the
R
datasets
package). The following line might list data for the largest

