Calebotomy

Teaching Perl - Week 3 - Textual Data

So last week you should have covered variables, conditionals, flow control statements, calling functions, stdin/out/err, and using new perl features.

This week we’re covering text processing, scalars, external libraries, and reinforcing what we learned last week. Text Processing is one of Perl’s specialty’s, mostly due to it’s powerful regular expression engine.

First install LWP with cpan. cpanp -i Bundle::LWP will do it.

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

getprint(“http://aur.archlinux.org/packages/modern-perl/modern-perl/PKGBUILD");

So this is almost completely new… what does it do? and what does it have to do with text?

LWP is short for lib www perl. LWP::Simple is a simple procedural interface to it. getprint is a method which is roughly equivilant to fetch uri and print the content that you got back. So this isn’t very useful… (ok it’s actually quite useful if you’re on arch and you want a PKGBUILD) what we want to do in this case is cut the junk and print the pkgname. As a side note, this code does the exact same thing as the python example in Head First Programming: A Learner's Guide to Programming Using the Python Language but is shorter if you cut the boilerplate (#! … use strict; use warnings ), and the same length if you don’t.

Now let’s have a first attempt at getting the package name shall we? We need to read the contents into a variable first. note: we’ve changed from getprint just to get because it returns the contents to a variable. (note: still shorter than the python.) At this point the book goes into how a string is an array of characters. Perl is contextual so at this point our scalar is essentially a string but just because it’s a scalar does not mean it’s currently a string. In our previous examples we were handling it in a numeric context.
So here we’re looking at the 11 character substring past character 281. The output should be modern-perl. This code is horrible, what should happen if I were to change this PKGBUILD? (if this ever doesn’t output modern-perl that’s probably exactly what happened). Remember say appends a newline character to it’s output print doesn’t. We used print previously because the output already had a newline at the end.

There are thousands of PKGBUILD’s on AUR let’s see if we can get the name from another one by changing the URI.The output at the time of this writing is
dirs’)
depe
and will probably change. A newline is just another character in the scalar so nothing prevents one from being printed within the substring. Our URI got a little long here so we moved split that peace of code up to be on 2 lines for readability. Your code lines shouldn’t go over 78 characters long if you can help it. A URI would be an acceptable exception though.

So we use the index function to search the string for some text and return the position we can then use that position to get our substring. The position returned is 1 less than where the string we searched for starts. So we added the 9 to get to the string we actually want. Our output is now
perl-moose’
why the ‘? well “perl-moose” is shorter than “modern-perl” so we’ve got extra characters in there.

Let’s use a regular expression.
regular expressions are really their own mini language, very terse, and several dialects. Have your students go through the Regular Expression Tutorial and the Regular Expression page. In short this regex looks for pkgname=‘something’ in $content and takes something and puts in in $pkgname. the ( ) in the / / tell which part of the regex to assign to $pkgname. [\w-]+ says to match 1 or more alphanumeric or _ or - characters. Regular expressions is a huge topic in and of itself. So don’t spend a huge amount of time on it, they are important to know, but you don’t have to master them right now.

I’d suggest having a copy of Regular Expression Pocket Reference: Regular Expressions for Perl, Ruby, PHP, Python, C, Java and .NET (Pocket Reference (O'Reilly)) and possibly Mastering Regular Expressions and if you want some common recipes Regular Expressions Cookbook. You should also warn your students that although you can use regular expressions to match any kind of text, that in the case of well defined formats, such as xml, html, csv, ini and more that there are better ways of extracting data. In the cases of things like email and phone numbers they should not write their own regex’s but use modules to validate such things as they will undoubtedly make mistakes doing so, and a module has been well tested.

The example in Head First Programming: A Learner's Guide to Programming Using the Python Language goes on to explain how to check the page periodically for updates, and only output when a value is less than a certain amount. Since I have no idea when I’ll be updating these PKGBUILDs I don’t want to get into that code. It would make a good homework assignment. You can use sleep to make sure you only fetch the code ever so often. I can guarantee barring a buggy release I won’t update any PKGBUILD more than once every 86400 seconds (1 day). You’ll have to check pkgver and pkgrel to see if either of them are newer than the previously fetched PKGBUILD.


Share

comments powered by Disqus