[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Text Manipulation
On Wednesday 24 July 2002 01:55 pm, Spearing Tyler Contractor USTC wrote:
> In this particular instance, the first field will be, say,
> 18 characters long, followed by a space, then three fields,
> say 8 characters long each followed by a space, and then a final field,
> say 5 characters long. A colleage of mine is attempting
> to evangelize me with the shell commands "cut" and "paste".
> I'm a little concerned with cutting out all the data into
> separate files, and then pasting it all back together
> into one file. It seems too easy to offset the lines
> somehow, or otherwise screw up the whole thing.
Yes, every rip through cut runs through your entire file *twice*. Then tack
*another* one on for the paste. This is not a good thing for a several
hundred megabyte data file.
Since you've shared more information about the exact parsing requirements, try
this:
perl -pi -e 's/(.{18})(.{8})(.{8})(.{8})(.{5})/$1 $2 $3 $4 $5/' FILENAME
See how easy that was. Just define each "field" in terms of a regex.
KISS. Note that if you have irregular data (i.e. exceptions to this uniform
view as implemented in the regex), then you may require more complex
processing. I'd still do it within the regex engine, since that's what it's
designed for. Maybe switch to a pattern match followed by code to reconstruct
the new line with arbitrary logic instead of a one-liner substitution. Just
use m// instead of s/// and reference $1, $2, $3, etc. as before, but in
code.
> I'm kind of thinking of a PERL script that will take
> the characters one at a time, and looping the appropriate
> number of times for each field of characters, and inserting
> the spaces between loops.
REGEXes are your friend. And at least an order of magnitude faster than
implementing it in straight Perl. Yuk. Not even Unicode scares the RegEx
engine!
> It just seems like this would be a fairly common
> problem with a published solution, but we have not yet been
> able to find it.
It is. It's just so trivial a solution that it takes longer to find someone
else that has the *exact* same problem as you than just doing it. That's also
why there are no books on "How to sign your own name".
However the definitive resource on RegEx munging is Jeffrey Friedl's book,
"Mastering Regular Expressions" (O'Reilly). Note: New 2nd edition is out.
Also note that this book is a *very* difficult read, as it's about how to
write RegEx parsers and how they work in exquisite detail.
An even better book for if you do loads of day-to-day Perl slinging is "Data
Munging in Perl" by David Cross (e-book (PDF) available for <$20). Trust me,
you'll want the dead-tree version.
Mike808/
--
() Join the ASCII ribbon campaign against HTML email and Microsoft-specific
/\ attachments. If I wanted to read HTML, I would have visited your website!
Support open standards.
-
To unsubscribe, send email to majordomo@silug.org with
"unsubscribe silug-discuss" in the body.