© 2009 . All rights reserved. Picture 2

you should use grep

TextWrangler
You should use GREP.
If you haven’t heard of grep before, it’s a very fancy pattern search tool for text editors. Think of the ‘find’ function but with more advanced capabilities such as searches for whitespace characters, end of line/file or particular character patterns.
I’ve become accustomed to some of the more trivial but useful tricks available using grep, some of which might be handy to you, dear reader.
Grep comes built in with a series of useful wildcards, key combinations which refer to specific text items. Examples of these include $ (end of line), [aeiou] (any vowel), \r (carriage return) and \s (whitespace character) which can be combined in clever ways to make text searching much easier.

To put all of this into context, I’ll give an example. I have recently been doing a bit of manual HTML scraping to isolate data contained in html tags from a number of pages. For this example, what I want to do is parse a series of html documents to extract the title text contained in each document, or;


Say your example html reads like this;






To do this you could manually open each and every file, select the text to delete, delete it and then save/close the file – a very time consuming process. There must be a better way, surely! Thankfully there is – Grep. My grep tool of choice is the very handy freeware application TextWrangler, which combines the very powerful grep pattern searching abilities with an absolutely essential multi-file search option.
So how can we do this quickly and painlessly?

1. Open a few of the documents and check that the syntax and layout are all similar. In this case lets assume that all files follow the aforementioned layout.

2. Select ‘find’ (command+F), making sure to choose the ‘use grep’ and ‘start from top’ options.

3. Type in the following search pattern;
Picture 2

(select from start of file[any space or nonspace character]zero, one or more chatacters until is found).

4. This will select all text from the start of the file, to the end of your tag. Test this as a multi-file search on all open documents to make sure your search was accurate.

5. Do a find + replace with the same search pattern, making sure to replace with nothing (blank replace field).

6. Next is a similar search pattern but now we’re looking for everything from the tag to the end of file.
The search pattern is;
Picture 3

Which will select all text that is not the actual title text.

7. Do the same find and replace function for all open html documents, however this time it will be useful to replace all text following the title with a single comma ‘,’. This will come in handy for automatically building lists.

So now we have a lot of text documents with only the exact text we’re looking for in them, but we’re still faced with the problem of scale. We may be able to open each file and copy/paste the contents into another, but what if we needed to do this 1,000 times over? Or 10,000 times? Surely there’s a better way to do this?
Happily there is. Another built-in feature of TextWrangler is the Edit – Insert – File Contents menu item. Once your html text files are stripped of all extraneous content and ‘comma-delimited’ the next step is to combine them.

9. Open one document, select Edit – Insert – File Contents, then select all of the remaining documents you wish to combine into one.

10. Once the text files are combined, save this new document as a new document.
Done!

Naturally this also applies to other text components in other types of html tags. If, for example, you had another set of tags which were nested – say if you wanted to select all items in a

but not the

items, you could follow the same process. I recently used this technique to strip out all text that wasn’t usable from the cityrail.info timetable pages – something which could have taken years had it not been for the abilities of TextWrangler + Grep. Highly recommended. Click here to read more on TextWranglers’ features.

J

(Update: I ended up pasting in images of the search patterns because the “\” character doesn’t seem to show up on the page – so copy+pasting the text to try yourself would fail, somewhat missing the point of this post. You will need to copy them out manually, which is also missing the point of this post, but it’s not a whole lot of text to copy so it isn’t too much of a problem. Shame I couldn’t resolve the html parsing, not sure why).

One Comment

  1. Posted 30 Nov ’09 at 1:15 pm | Permalink

    Dear Author http://www.jasonmcdermott.net !
    I can look for the reference to a site on which there is a lot of information on this question.

One Trackback

  1. [...] Excerpt from: you should use grep | informal dialogue [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>