Compare files, extract the differences

wicky · Jan 14, 2009

I did an HTML mail out for a client against a supplied .csv file. The client has now asked me to send a second mail out from an updated .csv file.

The problem is that both files are VERY large & in a completely different order.

I'm looking for the best/ quickest/ easiest way to resolve this because I don't fancy handpicking the differences.

Does anybody know of either software or a utility that would do this job, or alternatively a relevant Terminal command line approach.

Cheers

fryke · Jan 14, 2009

Open with Excel (or similar app), sort the results of both documents and do a merge. Has got to be possible with Excel (or similar app).

Mikuro · Jan 14, 2009

TextWrangler has a function to delete duplicate lines in a text file. Would that work?

fryke · Jan 14, 2009

I guess that would do it, yes. "Process Duplicate Lines..." in the "Text" menu does just that, regardless whether the lines are in order. So you just append the second file to the first and run that command. Only works well if the lines are _exactly_ the same, of course.

wicky · Jan 14, 2009

It's in csv format, so all appears as one line.

Is there a command in Wrangler's "find & replace" that I can use to separate onto multiple lines (ie. ", " -> "carriage return")?

Thanks

bbloke · Jan 14, 2009

You may already be embarking on a different route for accomplishing your objective, but I thought I'd make a quick note that there is a useful UNIX command for similar tasks. diff lets you compare two files and outputs the differences between them. It's been very useful for me in the past, although you will of course want to convert the file from the csv format first.

Mikuro · Jan 14, 2009

wicky said:
Is there a command in Wrangler's "find & replace" that I can use to separate onto multiple lines (ie. ", " -> "carriage return")?

Yes. In the find/replace dialog, put "\r" in the Replace field.

wicky · Jan 14, 2009

I'm feeling a bit dumb here...

I've processed the 2 csv's, so now I have 2 txt files each with an email address per line. If I add the 2 sets of content together and remove duplicate lines I will end up with a replica of the second (newer file).The newer file is exactly the same content as the older file but with some additions.

What I'm trying to achieve is just finding the differences. Which should amount to about 135 email addresses.

Am I missing something obvious?

wicky · Jan 14, 2009

I tried "diff -ib" in the terminal, however it output eveything.

Is there a way to just get the differences
(ie. the addresses that only appear in one of the files)?

Thanks for your help/ patience/ etc/.

Mikuro · Jan 14, 2009

Ah, I misunderstood what you wanted. Nevertheless, I think TextWrangler can do it. In the Process Duplicate Lines dialog, change the top option from "leaving one" to "matching all", then check the "delete duplicate lines" box.

wicky · Jan 14, 2009

Worked a treat.... I think.
At least I've ended up with a handful of email addresses rather than **many**.

Thanks!!

Compare files, extract the differences

wicky

play thing

fryke

Moderator

Mikuro

Crotchety UI Nitpicker

fryke

Moderator

wicky

play thing

bbloke

Registered

Mikuro

Crotchety UI Nitpicker

wicky

play thing

wicky

play thing

Mikuro

Crotchety UI Nitpicker

wicky

play thing