This is Greptastic, an entry originally posted on April 18, 2003 in the blog nebulose.net. In chronological order, before this was Serial. After this comes Julia. If you're lost, I recommend the about page.

Other destinations:


Greptastic

This builds on Dean Allen's tutorial, "Finding needles in a text haystack with Regular Expressions". I call it "Finding needles in a text haystack, melting them down into little motes and ingots, and reshaping them into something useful with Grep".

Let's imagine you've somehow gotten yourself the project of converting an existing site to Movable Type. The site contains some 208 entries, in about a dozen categories. What you have is the HTML files, authored in--horror of horrors--that heinous blight on sensible web design everywhere, Microsoft FrontPage. What you need is perfectly formed, whitespace-sensitive metadata, according to the Movable Type import format.

Is this an elaborate hypothetical situation or what?

Grep works by recognizing patterns. Suppose you notice that in the garble of nested tables and meaningless divisions, the real content always comes right after the code <td class="bord" width="74%" height="76%" valign="top">. You can use Grep to delete everything from the beginning of the document (<html>) to that point, inclusive. Hit F3 or Control+F in your Grep-enlightened text editor (EditPad, Microsoft Word, BBEdit), turn on Regular Expressions, and search for:

<html>.*?<td class="bord" width="74%" height="76%" valign="top">

Explanation:

  1. <html> is what you want the selection to begin with. The first thing in the pattern.
  2. A period in Grep syntax means "any character". Anything at all, as long as it's after <html>.
  3. The * means "repeat the previous item zero or more times". Since the previous item is a period, it means any number of any characters. So .* is essentially a variable-length wildcard.
  4. The ? disables the "greed" of the * operator. Basically, it tells it only to go to the shortest possible selection instead of the longest.
  5. The last part is what we want to be the end of the selection. Grep will select everything (.*) up until the first instance of this ending code. If the ? operator were not present, it would select everything up until the last instance.

There also might be a lot of garbage at the end of the file, so you could remove that using a similiar expression:

</td>.*?</html>

Now we're getting somewhere! FrontPage still frustrates our efforts, though--it has a habit of putting meaningless spaces in front of every line. The Movable Type importing mechanism won't tolerate extra spaces, so we need to delete them. A sample substitution would be to search and replace:

^ *(.*?)
[line break]

With:

\1
[line break]

Huh?

  1. A carat is the Grep symbol for the concept "Beginning of line". We're looking at a selection that starts with the beginning of a line.
  2. Next is a space followed by an asterisk. Remember that an asterisk means "repeated any number of times"--so, the beginning of a line, and then any number of spaces.
  3. Parenthesis create a backreference; something we want to be kept in memory and recalled later. Inside the parenthesis is another variable-length wildcard, .*?.
  4. The selection ends with a line break; only selecting one line at a time, going from the beginning (^) to the end.
  5. In the replacement string, we use a backslash and the number 1 to indicate that we want the contents of the first backreference (though this time there was only one). So the backreferenced part of the line--that is, everything except the leading spaces--is put right back. FrontPage, we laugh at thee.

Now we've winnowed out all the kruft; what's left is just rearranging things. Dean's tutorial covers that quite adequately, so I will end here and wish you all happy hunting.

« Serial | Home | Julia »