text crunching help

fbp_

captain obvious
ok, my company just sent out a big email to all the customers in our database. unfortunately that database was full of mispelled or expired email addresses... we got 10,000 undeliverables and need to cull them from the system while still retaining their company info. oh, and its all in lotusnotes...

i basically need to input a gigantic text file and strip out all the email addresses, putting them into another file

im semi competent with unix and think this is probably accomplished through some arcane combo of grep & cat, but im stuck here on my crappy win98 workstation so i cant work it out in my head...

if anyone could help out it would be much appreciated. if this works i might be able to talk them into upgrading the graphic design department's fleet of powermacs to osx!

any help is appreciated
 
grep and awk are the two command you want, I think.

I don't know what your file looks like, but grep will return any line matching (or not matching) a search term.

*************Example file*********
12/4/01 Joe Schmoe <jschmoe@nowhere.com> unknown user denied
12/4/01 Billy Bob Bares <bbb@herethere.com> unknown user denied
12/4/01 Geroge Bush <gw@whitehouse.gov> delivered ok
12/4/01 Heas Asmart <HASMART@university.edu> unknown user denied
12/4/01 Anonymous <secretuser007@secretplace.gov> unknown user denied
**************End Example file********

Typing the follow:
grep -i unknown filename #returns any line containing the word unknown
#upper or lower case doesn't matter due to
#the -i
grep -i .com filename #returns anything with .com in the file.
grep -v "delivered ok" filename #anything NOT delivered ok is returned



awk returns a part of a line. It's very powerful- many of the people on this list can do more with it- but here's the simple layout...

grep -i "unknown user" filename | awk '{print $3}'
The default delimiter is any whitespace (you can set it with -F) so the above will return just the email address of anyone that has gotten unknown user. Notice the | between the commands. It's called a Pipe and passes the output of one command to the next.

The last thing to tell you is the > which takes the output and puts it somewhere else.

grep -i "unknown user" filename | awk '{print $3}' > newfile
The above will put the output into a new file called newfile. Use two >> to append to a file, and one to ignore anything that exists (if anything).

The cat command (short for concatinate ((pardon my spelling)) appends two or more files together.

cat file1.txt file2.txt >> file3.txt
takes the contents of file1, then file file2 and makes a new file called file3.txt.


I hope this will get you started. It'll take a lot of playing to get what you need, but that's the fun part! :)

Good luck
-todd-
 
thanks for the help!

i think i might have worked out a partial solution
grep [[:space:]]*@*[[:space]] input.txt | cat output.txt

i dont get to actually begin fiddling until i get home though...

it looks like i need to use awk though because the raw text is the headers & bodies of 10,000 emails dumped from lotusnotes, and all i want are the addresses

ill be making a hopefully victorious post around 6:30est
 
it keeps thinking the input file is binary...

its just plain text though

could it having been zipped (40mb interoffice emails are rough) be responsible?
 
Back
Top