Help with text file processing

aquamacus

Registered
Okay, I have this text file that I was given and need some help in processing it. Its a list of names and address's that I need to import into database using a CSV file or something similiar.

Here is what the data looks like:

Janice Worthen
Last Name: Blow
First Name: Joe
Company: Company

Business: (XXX) XXX-XXXX

E-mail: somebody@something.com

That is how it is listed in this text file...if you need more info let me know...

There are 5000 names and I don't want to do each one individually...anyone know an easy way to make this work? I have been using excel to try to make it into what I want it to be, but not finding a good way to do so.
 
aquamacus said:
Okay, I have this text file that I was given and need some help in processing it. Its a list of names and address's that I need to import into database using a CSV file or something similiar.

Here is what the data looks like:

Janice Worthen
Last Name: Blow
First Name: Joe
Company: Company

Business: (XXX) XXX-XXXX

E-mail: somebody@something.com

That is how it is listed in this text file...if you need more info let me know...

There are 5000 names and I don't want to do each one individually...anyone know an easy way to make this work? I have been using excel to try to make it into what I want it to be, but not finding a good way to do so.

So we can assume that there is some non-headered value for the name, then a headered value "E-mail:" as the last line of a record. Is that consistent among all the records. Do all the records have the same number of lines, same order, same "white space" lines between them, etc?

For instance, there's not a Fax number is some, or multiple email addresses, etc?

Consistency is the key in text file processing. If it's consistent it turns out to be pretty easy to do what you want. If it's not, then we have to rethink some ideas. I could do up a python script or some C# or Java code to take care of that right quick. I'm sure there are several guys here who know Perl better than I do could pull some nice tricks out to do it as well.

In short, open file, begin reading, dump all lines into a bucket object, continue reading until you hit the email header, reset bucket. Bucket should then be given enough intelligence to parse up the several lines it has, dump them out to a CSV or SQL file.
 
The only real consistency in the file is that email is the last header in each line. They also all start out with a non-header item that is the full name. Some of these contain fax numbers, person's title, business name, and business address. If they all had the same lines and information I could do this on my own, but I don't know python (or perl) well enough to deal with the inconsistencies in it. It does start and end the same way though...thats about it...

I could send the file if you would like me too...
 
Consistency in these types of projects entails "cleaning" the data so it all looks the same. Then you can apply import rules and all fields will contain the correct information. I know, not much help… :)
 
Exactly. This would take either plenty of manual time or it would take some *good* work with regular expressions to pre-clean the data. I'd be glad to take a crack at it if you want to send me the file.

chris at chornbe dot com

If you're worried about the privacy and would rather not, I understand. I'll tinker with some reg-exes in the meantime and see if I can come up with a set of rules. Might be fun :)
 
No problem. I looked thru' the data and (grmble, grumble) some of the phone numbers are in the format (xxx) xxx-xxxx and some are in xxx-xxx-xxxx. Some have one, some have 2, some have none.

Well, it'll be fun :)
 
see that is my problem...I get some of it to work, but then it dies once it reaches tht kind of crap. I didn't export or create this file, it was just given to me. They said here it is and make it work...I wish I was better with python or perl...but I'm not...

Thanks again...
 
Aquamacus,

Ok, I've got a preliminary data scrubber application finished for you. Like I said, it was fun. Here's what I did... (I'm also emailing you the output file and the source code).

I open the file in a standard line-by-line text reader.

Each time I see a Tab at index 0 followed by text, I create a new container object.

I dump each line of text into the container, and loop that. Each new container (after being populated) gets dumped into a hash for holding. I suppose for larger data sets a serialization method and a temp file could be used.

Then I close the input file - done with it.

Now that I've got all the "records" separated, I loop each container and ask it for the names of all the fields it knows about. I've defined a field name by "any text starting at the beginning of a line up until the colon" (ie... "Fax Number: 555-555-1212") The field value then becomes the value after the colon. This returns a string array of just the field names.

The calling class takes that array of field names and adds each one to a hashtable of field names, dropping dupes.

At the end of looping the containers, I then have a comprehensive list of all available fieldnames in the entire data set.

Then, I simply loop the containers again, asking for each container's value for field name x. The container class has a very generic GetValue(string key){} function that either returns "" or the value found to the right of the colon on a line starting with the requested key.

Kind of like (dots for formatting indents):

string[] flds = GetAllFields( list_of_containers );
foreach( container ){
....foreach( fld ) {
........output_formatted_values;
....}
}

It's very much debug/dev-time quality code, but it works. It could also be optimized by losing one of the loops and having the code tightened up. But it's a good place to start. I did it in C# (yes, on my mac via Mono) but it could have been done in any other language with similar constructs.

This is a good place to start. I didn't break apart the address records, so each address in the .CSV file will look like "124 some street anytown, PA 99999" You'll have to come up with an algorythm to break them up if you need to do so.

Hope this helps. It was fun to tinker with.
 
Back
Top