Grep Pattern Searching
In Episode 8, I mentioned how I used Grep Pattern Searching in BBEdit to search for patterns in my text, rather than searching for specific pieces of text. The value of this is that, even though the actual text varies throughout my text file, if the patterns are consistent, I can do very complex and powerful search-and-replace operations that keep that variable text intact, while changing elements of the pattern around it.
Here’s an example from real life: A monthly magazine column of new products. Each write-up starts with a company name on one line, followed by a paragraph starting with the words “What’s New” followed by a colon, then a sentence or two of descriptive text about the product. The next paragraph starts with “The Value” followed by a colon, then a few sentences describing the positive attributes of the product for prospective buyers. After that, there’s a line for the company’s web address and another line for their phone number. This is a consistent pattern. But the text itself is not consistent. Every company name is different. Everything following “What’s New” is different, as is everything following “The Value” and all of the URLs and phone numbers.
So how do you search this whole document? The thing to understand is that you don’t search for specific text. Rather, you search the pattern within which the text exists. What you’re searching for is any string of text (the company name), followed by a return, followed by the specific text “What’s New” and a colon, followed by any string of text (the product details), followed by a return, followed by the specific text “The Value” followed by a colon, followed by any string of text (the description of the product’s benefits), followed by a return, followed by any string of text (the web address) followed by a return, followed by any string of text (the phone number).
In BBEdit, that search instruction translates to this:
(.+)\rWHAT’S NEW:(.+)\rTHE VALUE:(.+)\r(.+)\r(.+)
The (.+) means any range of character or characters. The period is any character, and the plus sign extends that to mean any range of characters. The parenthesis around them makes them a sub-pattern. In other words, the whole line above is the pattern, the items in parenthesis are sub-patterns within that pattern. In the above example, there are 5 sub-patterns, each representing variable text. The \r elements refer to returns in the original text.
Now…let’s say I wanted to do a replace operation based on this pattern. I have style sheets in InDesign for each element: Company Name, What’s New paragraph, Value paragraph, URL and Phone. On top of that, I have a character from a dingbat font that I put before the URL and another that I put before the phone number to serve as little icons in the layout. All of this is handled expertly by InDesign’s nested style sheets, but I need to put the text characters (in this case, a lower case “u” for the URL icon and an ampersand for the phone icon) first for the nested style sheet in InDesign to work.
To replace this pattern so that all of my style sheet references are included in the right places and have my icon characters added, I use the following replace instruction:
The bracketed “pstyle” elements are tags that InDesign will use to format this text automatically when placed in a document with the corresponding styles, the names of which follow the colon in the bracketed tag. The combinations of backslashes and numbers — \1 \2 \3 and so on — refer to the sub-patterns in the original search pattern. They’re numbered by their order in the search instruction. The text of the What’s New Paragraph is \2 and the text of the Phone Number is \5. They’re the second and fifth sub-patterns in the search.
By putting in these backslash-number combinations, you’re telling BBEdit to replace the original sub-pattern with itself. So every word of text that follows “What’s New:” until BBEdit finds a return will be replaced by…ITSELF. It remains exactly the same. Same with the other sub-patterns. The Company Name is replaced with itself, but now it has a tag around it. Similarly, the URL is replaced by itself, but now it is preceded by both the tag for its InDesign paragraph style, and the lower case “u” that will appear in a dingbat font in InDesign, thanks to nested style sheets.In my magazine, I have pages and pages of these product write-ups, so I make sure our editors put two returns between each one when they’re writing in Word, so that I can have BBEdit search for the pattern in every write-up. It’s as simple as searching for the same pattern shown above, but with two returns — \r\r — added at the end, like so:
(.+)\rWHAT’S NEW:(.+)\rTHE VALUE:(.+)\r(.+)\r(.+)\r\r
Likewise, my replace pattern would also include those two returns. It would look like this:
The only thing left is to add one little bit of information to the very first line of this text file:
Makes you want to go out and start finding patterns in all of you text, doesn’t it?
You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.
January 28th, 2006 at 11:21 am
A quick additional note on this topic, and to my mentioning in Episode 8 that I didn’t know of a Windows equivalent text editor that does pattern searching: I got an e-mail from a listener about an application called UltraEdit that does regular expression searches. I haven’t used it myself, but if you want more info on the product, you can read about it on the manufacturer’s web site at http://www.ultraedit.com. I’d like to thank Adam for that info.
February 17th, 2006 at 12:47 pm
Thanks for the info.
Can you also use cstyle (character style or wahtever the code is) in this working method ?
February 17th, 2006 at 12:52 pm
Hello,
A question: can you also integrate characterstyles while placing text ?
February 18th, 2006 at 10:14 am
Yes, you can also integrate character styles into Grep Pattern searches. I didn’t mention it in the original post, because applying character styles manually doesn’t usually follow a specific pattern. If you have character styles that always fall in a specific place (for example, at the beginning of a paragraph or after the first sentence), it’s better to nest those character styles inside of the paragraph style by using the Nested Style Sheet settings (see Episode 11 which explains how Nested Style Sheets work). That way, you don’t even have to reference the character styles. They’ll be “built in” to your paragraph style.
If you DO have a specific reason for applying a character style into your source text, the syntax is as follows:
<cstyle:Your Style Name>your text here<cstyle:>
February 22nd, 2006 at 4:42 pm
Thank you for the info and sorry for posting the same question twice. My mistake.
March 6th, 2006 at 5:57 am
I’ve read your remarks about character styles and nested Styles. But I was thinking about applying styles to text within one paragraph without fixed following. For instance – some text within a paragraph has to be bold, or another font, some other text has to have a color, etc; but these applyings don’t come in the same following. I can say to my relation to place a certain code before and after these ‘text modifications’ so that I can change al of this with pattern searching.
Or maybe this is not the wright tool for this???!!!
March 7th, 2006 at 6:27 pm
Without seeing a sample of exactly what you’re working with, it’s hard for me to give a concrete answer to this question. I’ve sent an e-mail to you with some specifics that might help. But the best way for me to answer accurately is to see a sample of how your text needs to be formatted.
July 2nd, 2006 at 5:18 pm
Thank you for this explanation. Two years ago i tried to get this automated pattern thing going. I gave up by lack of information. Tried to export ‘tagged text’ from indesign and go with that. Didn’t work. Today i learned why.
As you explained to JanDeMan to use your text here i used that to go with. Didn’t work… Looking over the text i noticed something. All sharp brackets were escaped. your text here looked like \your text here\
My text editor tried to be smart. As soon as i told it not to touch that it worked. I didn’t noticed before. As said, i worked with an export out of Indesign, and my editor changed the tags as soon as i edit the text. I didn’t know better or it should be /.
I finally can use my scriptable text-editor with grep-pattern to change text between curved brackets into italic, capitalized abbreviations to small-caps, certain names to bold. Again thanks in a bundle!
July 2nd, 2006 at 5:20 pm
Humm, all tags were removed in last post. I’m afraid you keep puzzeld about what i tried to explain…
July 2nd, 2006 at 9:55 pm
I couldn’t quite tell if you were saying that the special characters in my post weren’t coming through properly, or if those in your post didn’t come through. Let me know if this isn’t working for you and I’ll try to clarify.
July 3rd, 2006 at 2:13 pm
Okay, i should wrote ‘Humm, all tags were removed in MY last post. I’m afraid you keep puzzled about what i tried to explain…’
In your post those hooked parenthesized (like around pstyle:) didn’t cane trough in my post. Maybe thay also don’t come trough in regular mail. Because those are also used in html, and can be dangerous if used in posts. So i have to try to explain in un-visualy in a for me non native language.
(English is as said not my native language, and therefore i did not understand the manual about tagging text. Thats why i started of whit an tagged text exported bY Indesign itself.) What happened to me, i used an editor witch assumed the tagged text exported by Indesign was html or xml, Accordingly it immediately ‘corrected al the errors’ resulting in a faulty tagged text. That faulty text, i used to do some experimenting with pattern search, and putting formatting on it. As i started of faulty, my hard work never payed off.
Thanks to you explaining i suddenly (after two year on and off trying) saw what was going wrong. And now -by a little hear out of an elephants tail- (literal translation of a dutch saying) i am finessed making a script i expect safes me about half the time this specific job takes. (putting format to specific patterns of chars). Just by dropping a bunch of text-files on that script thy receive about 95% of their formatting.
The texteditor i use is Tex-Edit Plus. It treats grep tiny bit different. Instead-of /1 /2 /3 it uses ^1 ^2 ^3 to get hold of subexpressions. If someone is interested, i can put my (apple)script online.
July 4th, 2006 at 10:58 am
I’m glad to hear that this method has cut your work time in half. I’ve put grep to work everywhere I can for my job, and it has reduced hours of work to mere minutes.
August 5th, 2006 at 6:09 pm
At first it seems going as a rocket. I must do something wrong tough. Using Grep to make a tagged-text it seems to work. Importing it in indesign, all ‘cstyle’ elements will do as expected, but ‘pstyle’ however seems to gets lost.
The imported text got no paragraph style applied to it what so ever Not even ‘Basic Paragraph’. Below a part of the tagged text.
[code]
INGEZONDEN MEDEDELING, van onze correspondent facilitaire zaken
(Heer Bommel en de Hopsa’s, BV 147, 8079)
[/code]
Needless to say the Indesign-doccument got a paragraph style called ‘Brood (artikel tekst)’ and the import generates no error.
The reason it took me quite a bit to realize things did not go that smooth; When importing the Tagged-text, the text got the char-style applied witch was selected in the pallet at the time of import. More or less accidentally my master char-style for the paragraph style I imported.
I noticed hardly any changes to the text, tweaking the settings of ‘Brood (artikel tekst)’. Just moments ago, i realized it was not ‘hardly any changes’ but no changes at all…
Ed.
August 5th, 2006 at 6:15 pm
Humm… Your blog does not allow for me to post the indesign tag’s. Proberbly becouse the comment system does not use bb edit code (if it did, all text between [code] and [code] would show literaly)
So how can I (we, the users) show a sample in a comment post?
Ed.
August 6th, 2006 at 1:15 pm
Ed —
E-mail me the text file, and I’ll see if anything’s missing that might be causing your problem. The thing about showing code on the site is to “escape out” the special characters. For example: to display an opening angle bracket, you need type ampersand-l-t-semicolon, and to type a closing angle bracket (the l and t stand for “less than”, which makes it easier to remember). To display a closing anngle bracket, type ampersand-g-t-semicolon (greater than).
August 6th, 2006 at 3:05 pm
[Babble on]
I Did not realized escaping < would work. In an earlier post I tried to color-code my text with html, that failed to work. That’s why I tried bb-code. The missing code was sent to info at your url
[Babble off]
Lets try again in posting the tagged tekst (fingers crossed). It should appear between the both [code]’s
[code]
<ASCII-MAC>
<pstyle:Brood \(artikel tekst\)>INGEZONDEN MEDEDELING, van onze correspondent facilitaire zaken<pstyle:>
<pstyle:Brood \(artikel tekst\)>(<cstyle:TussenHaakjes>Heer Bommel en de Hopsa’s, <cstyle:SmallCaps>BV<cstyle:> 147, 8079<cstyle:>)<pstyle:>
[code]
[code]
<ASCII-MAC>
<pstyle:Brood \(artikel tekst\)>INGEZONDEN MEDEDELING, van onze correspondent facilitaire zaken<pstyle:>
<pstyle:Brood \(artikel tekst\)>(<cstyle:TussenHaakjes>Heer Bommel en de Hopsa’s, <cstyle:SmallCaps>BV<cstyle:> 147, 8079<cstyle:>)<pstyle:>
[/code]
Ed.
August 6th, 2006 at 3:08 pm
Whoops, Must have pressed paste button twice. Sorry for wasting the environmental friendly, but expensive recycled electrons used in this blog…
Ed.
August 6th, 2006 at 3:17 pm
Ed —
I think what’s breaking your tagging once it’s brought into InDesign is the backslashes before the parentesis around “artikel tekst”. Actually, I’m surprised you don’t get an error message when placing the text. My experience is that InDesign displays a warning that text can’t be imported when styles identified with tags do not correspond exactly to styles in the document. If your paragraph style is named “Brood (artikel tekst)”, your incoming text should not have the backslashes in it.
August 6th, 2006 at 3:57 pm
Well Michael, those backslashes before the parenthesis were put there by Indesign itself. My paragraph style is indeed named “Brood (artikel tekst)†On exporting it those (also escape?*) backslashes appeared.
But I tried as you mentioned without those backslashes, using exact spelling of the paragraph style in the application. Result is the same. paragraph style seems not to get imported.
A thought prang up. What if those backslashes were not put there intentionally by Indesign’s export module, but it’s an misuse of functions StripSlashes and AdSlashes in the code… A bug, or undocumented feature so to speak…
I’ll try renaming al my styles, into not using parenthesis. You will hear the result. (should i also not use spaces just to be safe?)
*
A backslash is used as an escape char in a some programming languages. It works more or less like the -earlier in this rope- mentioned & for HTML.
Ed.
August 6th, 2006 at 4:03 pm
Spaces are fine in style names. Parenthesis can be dealt with, but if they’re causing trouble, try removing them and see if it works.