[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Text Manipulation

To: silug-discuss@silug.org
Subject: Re: Text Manipulation
From: mike808 <mike808@users.sourceforge.net>
Date: Wed, 24 Jul 2002 21:08:50 -0500
In-Reply-To: <CEC866502149D311914000204840B32106806222@ustcvex02.transcom.mil>
Organization: Southern Illinois Linux Users Group
References: <CEC866502149D311914000204840B32106806222@ustcvex02.transcom.mil>
Reply-To: silug-discuss@silug.org
Sender: silug-discuss-owner@silug.org
User-Agent: KMail/1.4.1

On Wednesday 24 July 2002 01:55 pm, Spearing Tyler Contractor USTC wrote:
> In this particular instance, the first field will be, say,
> 18 characters long, followed by a space, then three fields,
> say 8 characters long each followed by a space, and then a final field,
> say 5 characters long.  A colleage of mine is attempting
> to evangelize me with the shell commands "cut" and "paste".
> I'm a little concerned with cutting out all the data into
> separate files, and then pasting it all back together
> into one file.  It seems too easy to offset the lines
> somehow, or otherwise screw up the whole thing.

Yes, every rip through cut runs through your entire file *twice*. Then tack 
*another* one on for the paste. This is not a good thing for a several 
hundred megabyte data file.

Since you've shared more information about the exact parsing requirements, try 
this:

perl -pi -e 's/(.{18})(.{8})(.{8})(.{8})(.{5})/$1 $2 $3 $4 $5/' FILENAME

See how easy that was. Just define each "field" in terms of a regex.
KISS. Note that if you have irregular data (i.e. exceptions to this uniform 
view as implemented in the regex), then you may require more complex 
processing. I'd still do it within the regex engine, since that's what it's 
designed for. Maybe switch to a pattern match followed by code to reconstruct 
the new line with arbitrary logic instead of a one-liner substitution. Just 
use m// instead of s/// and reference $1, $2, $3, etc. as before, but in 
code.

> I'm kind of thinking of a PERL script that will take
> the characters one at a time, and looping the appropriate
> number of times for each field of characters, and inserting
> the spaces between loops.

REGEXes are your friend. And at least an order of magnitude faster than 
implementing it in straight Perl. Yuk. Not even Unicode scares the RegEx 
engine!

> It just seems like this would be a fairly common
> problem with a published solution, but we have not yet been
> able to find it.

It is. It's just so trivial a solution that it takes longer to find someone 
else that has the *exact* same problem as you than just doing it. That's also 
why there are no books on "How to sign your own name".

However the definitive resource on RegEx munging is Jeffrey Friedl's book,
"Mastering Regular Expressions" (O'Reilly). Note: New 2nd edition is out.
Also note that this book is a *very* difficult read, as it's about how to 
write RegEx parsers and how they work in exquisite detail.

An even better book for if you do loads of day-to-day Perl slinging is "Data 
Munging in Perl" by David Cross (e-book (PDF) available for <$20). Trust me, 
you'll want the dead-tree version.

Mike808/
-- 
() Join the ASCII ribbon campaign against HTML email and Microsoft-specific
/\ attachments. If I wanted to read HTML, I would have visited your website!
Support open standards.

-
To unsubscribe, send email to majordomo@silug.org with
"unsubscribe silug-discuss" in the body.

References:
- Re: Text Manipulation
  - From: Spearing Tyler Contractor USTC <Tyler.Spearing@hq.transcom.mil>

Prev by Date: Re: Text Manipulation
Next by Date: Re: doing things the hard way (was Re: SSH preferences, babble)
Prev by thread: Re: Text Manipulation
Next by thread: Re: Text Manipulation
Index(es):
- Date
- Thread