image
image

Go Back   macosx.com > Design, Media, Programming & Scripting > Software Programming & Web Scripting

Reply
 
LinkBack Thread Tools
  #1  
Old March 23rd, 2008, 09:27 PM
Mikuro's Avatar
Crotchety UI Nitpicker
 
Join Date: Mar 2005
Posts: 2,696
Thanks: 6
Thanked 55 Times in 50 Posts
Mikuro will become famous soon enough
Finding unique files within Backups.backupsdb

I'm writing a program that processes data in Time Machine's Backups.backupsdb folder. The important thing is that there's no reason for me to process the same file twice, and Time Machine stores multiple links to the same files — one in every single incremental backup.

I can test whether a file has already been processed by recording the inode numbers of every file and comparing new files against the list (I use an NSMutableSet for this task). The problem is that even just going through the directories, without doing any significant processing of the duplicate links, takes an obscene amount of time.

I estimate that there are about 20 million files (and 3.5 million folders) in my backupsdb folder, but only about 500,000 unique files. Going through all those duplicates is just not reasonable. It would literally take all day to process the folder.

Is there any FAST way to ignore duplicate links? Maybe by accessing files directly by inode number instead of path? Is there any way to get a list of all paths pointing to a given inode, or of all inodes within a given directory?

I can't think of any way to get this done. Any ideas?
__________________
Mac mini — 1.25GHz G4, 1GB RAM — OS 10.5.8

Useful programs: Privoxy, Butler, ffmpegX, VLC, Perian, Tofu, Wcalc
Reply With Quote
  #2  
Old March 24th, 2008, 03:18 PM
ElDiabloConCaca's Avatar
Registered User
 
Join Date: Aug 2001
Location: San Antonio, Texas
Posts: 12,993
Thanks: 7
Thanked 443 Times in 424 Posts
ElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of light
Each unique file on the backup is a file once and only once, and every other time in the backup, appears as a link.

So, instead of processing all files, simply process real files, and ignore links altogether (done with a simple UNIX 'file' test). Go through the backup reverse-chronologically, and you're guaranteed to always process the newest version of a file first.

Edit: Hmmm... I just thought about my algorithm, and it may be severely flawed... tee hee... if you only process all real files and skip links, then you'll end up having to touch all files in the backup... not very efficient.

With the UNIX 'find' command, you can use the -f switch to ignore symbolic links. You could run the UNIX 'find' command once on your backup drive at the beginning of the program, printing out all paths to all real files. You could then parse this list of files and do your processing.
__________________
Mac mini 2.0GHz 10.6.2 • 4GB • 320GB • Superdrive • 4 x 1TB USB 2.0 • LED Cinema Display
MacBook 2.0GHz Core 2 Duo - White 10.6.2 • 4GB • 250GB • CD-RW/DVD-ROM
iPhone 3G 8GB • iPod Touch 8GB • iPod Photo 60GB • iPod nano 1GB • AT&T U-Verse 12Mb/1.5Mb
http://www.jeffhoppe.com

Last edited by ElDiabloConCaca; March 24th, 2008 at 03:23 PM.
Reply With Quote
  #3  
Old March 25th, 2008, 04:00 AM
Mikuro's Avatar
Crotchety UI Nitpicker
 
Join Date: Mar 2005
Posts: 2,696
Thanks: 6
Thanked 55 Times in 50 Posts
Mikuro will become famous soon enough
Thanks for the ideas. I actually do want to process all files, including old versions, so that's not a problem. That's the whole point, really — I'm writing a program to make a number of Burn Folders out of larger folders. It's just that dealing with Time Machine's data is proving trickier than I thought. I want to be able to drop my Backups.backupdb on it, and then burn all the (unique) data to DVDs for archiving. My entire backupsdb should fit on just a handful of DVDs if I can properly deconstruct it. Then from there on I could easily burn just the new entries in Time Machine to new DVDs.

Time Machine uses hard links, though, not symbolic links, so I'm not sure that find trick will work. Or will it? I can't find it documented, so I don't know the details. The man page for find lists a -f flag, but it's just to specify the directory to search. From the man page: "-f Specify a file hierarchy for find to traverse. File hierarchies may also be specified as the operands immediately following the options." Am I looking in the wrong place?

That gives me another idea, though. If I used find to return only items modified after the date of the previous backup entry, then that should (in theory) return only new files. Whether it would be significantly faster than my current method, I'm not sure. I assume 'find' traverses the entire directory structure much the same way I do currently, in which case probably not. Spotlight might do a better job, but I'd hate to require Spotlight — especially since I myself exclude my Backups.backupdb folder from Spotlight for performance!

For that matter, maybe the thing that's really slowing me down is Cocoa. Right now I'm using [NSFileManager enumeratorAtPath:], and managing autorelease pools manually. Maybe I ought to use low-level functions instead.
__________________
Mac mini — 1.25GHz G4, 1GB RAM — OS 10.5.8

Useful programs: Privoxy, Butler, ffmpegX, VLC, Perian, Tofu, Wcalc
Reply With Quote
  #4  
Old March 25th, 2008, 09:17 AM
ElDiabloConCaca's Avatar
Registered User
 
Join Date: Aug 2001
Location: San Antonio, Texas
Posts: 12,993
Thanks: 7
Thanked 443 Times in 424 Posts
ElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of light
Quote:
Originally Posted by Mikuro View Post
That gives me another idea, though. If I used find to return only items modified after the date of the previous backup entry, then that should (in theory) return only new files.
I only see one problem with that -- hard links look just like the file itself to the operating system, and with Time Machine, new hard links to old files are created, so if you process those new hard links, you're processing old files.

I do believe there is a way to find out exactly how many hard links to a certain file there is... I think if you execute 'ls -L <pathtofile>' then that will tell you how many hard links point to that file... perhaps that could help in some way? I would highly recommend reading up on that option, though, before taking my word for it... hehe...
__________________
Mac mini 2.0GHz 10.6.2 • 4GB • 320GB • Superdrive • 4 x 1TB USB 2.0 • LED Cinema Display
MacBook 2.0GHz Core 2 Duo - White 10.6.2 • 4GB • 250GB • CD-RW/DVD-ROM
iPhone 3G 8GB • iPod Touch 8GB • iPod Photo 60GB • iPod nano 1GB • AT&T U-Verse 12Mb/1.5Mb
http://www.jeffhoppe.com
Reply With Quote
  #5  
Old March 25th, 2008, 10:22 AM
Mikuro's Avatar
Crotchety UI Nitpicker
 
Join Date: Mar 2005
Posts: 2,696
Thanks: 6
Thanked 55 Times in 50 Posts
Mikuro will become famous soon enough
Yeah, I can see the number of links using 'stat' (I forget the exact options I need, but I know it's possible), but I can't see the locations of each one as far as I can tell.

I'm a little unclear on how hard links work. I was assuming that they all share the same creation/modification/etc. dates, regardless of when each link was made. So if I take a backup dated March 25, and look for only the items modified after March 24 (I'll need to be more precise than that, but you get the idea), it should work...in theory. I'm not thrilled with the idea, though, because it seems like there's a lot of chance for some weird problem I won't notice/consider, and it will be difficult to verify that the results are 100% accurate 'by hand'.

Also, it looks like the find command isn't all that fast to begin with. I guess I'm asking for the impossible here. Those links are there and it seems I can't pretend they're not — not without spending a lot of CPU time on the pretending, anyway!
__________________
Mac mini — 1.25GHz G4, 1GB RAM — OS 10.5.8

Useful programs: Privoxy, Butler, ffmpegX, VLC, Perian, Tofu, Wcalc
Reply With Quote
  #6  
Old March 25th, 2008, 10:30 AM
ElDiabloConCaca's Avatar
Registered User
 
Join Date: Aug 2001
Location: San Antonio, Texas
Posts: 12,993
Thanks: 7
Thanked 443 Times in 424 Posts
ElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of lightElDiabloConCaca is a glorious beacon of light
Well, I think that all hard links will "take on" the creation/modification date of the original file... so, if you have three hard links to one file, ALL the hard links will have identical creation/modification dates.

I think I have another suggestion -- 'ls -i' will give you the inode number for each file. You could then compare the inode numbers to make sure you don't "dupe" a file along the way.

Argh... at this point, I'm kinda grasping for ideas. I do believe that what you're trying to do is inherently "intensive" since you are doing a lot of "looking" at a lot of different files.

If I think of something magical, I'll post back, but I think we're bearing down on the conclusion that there are a number of ways to accomplish this, none of which will be lightning-quick, unfortunately.
__________________
Mac mini 2.0GHz 10.6.2 • 4GB • 320GB • Superdrive • 4 x 1TB USB 2.0 • LED Cinema Display
MacBook 2.0GHz Core 2 Duo - White 10.6.2 • 4GB • 250GB • CD-RW/DVD-ROM
iPhone 3G 8GB • iPod Touch 8GB • iPod Photo 60GB • iPod nano 1GB • AT&T U-Verse 12Mb/1.5Mb
http://www.jeffhoppe.com
Reply With Quote
  #7  
Old March 25th, 2008, 11:51 AM
Captain Code's Avatar
Moderator
 
Join Date: Aug 2001
Location: Ontario, Canada
Posts: 3,119
Thanks: 0
Thanked 7 Times in 1 Post
Captain Code will become famous soon enough
Check out FSEvents. From what I remember you can get changes to the file system between two points from it so that might work. I also wrote some code to process a large number of files looking for only executable binary files and wound up finding some code that reads the first 4 bytes of the file and I can check that value against what it should be for executable code. It worked so much faster doing it that way then launching thousands of NSTasks.

I'm not sure how far back FSEvents would go so that might not work but I remember from WWDC you could query it for changes since your last query & other things like that. It might only work since your last query if the process hasn't been quit and relaunched but I don't really know just guessing.
__________________
MacBook Pro 2.16GHz Core2Duo 3GB RAM, G4 1.4GHz OSX Tiger 1.25GB RAM, Dual 2GHz G5 OSX Tiger 2GB RAM (freakin shweet)
Athlon 64 Windoze XP for school work (programming) 1GB RAM
dferns@macosx.com
Reply With Quote
  #8  
Old March 25th, 2008, 10:36 PM
Mikuro's Avatar
Crotchety UI Nitpicker
 
Join Date: Mar 2005
Posts: 2,696
Thanks: 6
Thanked 55 Times in 50 Posts
Mikuro will become famous soon enough
I hadn't considered FSEvents. I remember hearing about them before Leopard came out. From a quick look through the docs, it looks like it's really meant for real-time monitoring of folders for changes, though. I'll take a closer look later.

I just ran a little test to see what the maximum speed I could expect would be when going through every file. I ran this:
Code:
find ./Backups.backupdb -print >> ~/Desktop/finddump
I let it run for 10 seconds to see how many files it could return. After those 10 seconds, the text file had 47,000 entries in it. Not bad at all. There are roughly 470,000 entries (files + folders) per snapshot, so that makes rough math very easy. So, 100 seconds per snapshot, 50 snapshots (increasing by one per week). That's 80 minutes just to dump the path names of all those items, without any special processing. Since my processing will most likely take longer than the directory traversal itself, I guess I need to lower my standards down to "fast enough to complete overnight".

My program currently goes through an abysmal 1200 entries in 10 seconds, almost 40x slower. Of course, it's doing a fair amount of processing in there (creating a SQLite database for future use, and of course recording inodes), and it would get a lot faster after it finished the first snapshot (which would take about 65 minutes, at that rate). I'll have to test how [NSFileManager enumeratorAtPath:] performs when I don't do any special processing, so I can better compare it to find. If it turns out to be a bottleneck, I'll rewrite my program. If it's only a bit slower, I'll accept it and look for something else to optimize.

Thanks a lot for the ideas, guys.
__________________
Mac mini — 1.25GHz G4, 1GB RAM — OS 10.5.8

Useful programs: Privoxy, Butler, ffmpegX, VLC, Perian, Tofu, Wcalc
Reply With Quote
Reply

Bookmarks

Tags
backups, backupsdb, programming, time machine

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off

Forum Jump


All times are GMT -5. The time now is 12:57 PM.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.2
Copyright 2000-2010 DigitalCrowd, Inc.