Just moozing

Before you can check your notes, you must make them…

Summer cleanup

with 2 comments

I’m running out of disk space. So it leaves two choices; buy a new disk or remove the old unused stuff. I know that I have email archives and other documents that I cannot find after last reinstallation, so I need to be sure that I don’t delete the wrong backups.

The plan

One of my philosophies is that thing you cant find doesn’t exist and it ought to be thrown out. A solution that I have used earlier is to make yearly-ISOs. It has that advantages of being one file and being read-only. It basically means that it is very easy to have an overview and to be sure that I don’t erase or change old stuff by accident. Also, ISO (as in ISO-9660) may be mounted both on Linux and Windows, so it a safe solution. I am currently considering if squashfs is useful for this also – it is on of many file systems included in kernel 2.6.34.

The current scenario is that I have a a couple of external USB harddisks, some 3½” hard disk drives in the drawer and some PC’s (and a NAS) with a lot of data on them. The data formats range from Windows Backup .tib format, .ISOs, .tgz and raw disk images (made with partimage). I know that I have a lot of duplicate data. An example is my mp3 collection, that I have backed up on every machine since I started having mp3s.

The process is fairly simple.

  1. Chose an archive or directory
  2. Move/extract data to temporary location
  3. Check for duplicate data with Master repository
  4. Merge with Master repository
  5. Delete original
  6. Return to 1

And in the end make the yearly ISOs.

I have been inspired by Python for Unix and Linux system administrators to do a collection of python scripts. It is not trivial to check if data is duplicated, so I decided on having fun and doing md5 sums on the files. This takes forever when you have a lot of data, so I decided on caching the md5 sums to avoid waiting when reprocessing known files. Merging will be done manually, and creating the ISO/squasfs files will be done later.

Software used

Since I will try do to it correctly, it will include

I consider using GIT for version control, but I have SVN up and running on my server, so that will be a project for later.

Python version is 2.5.5, it is not something that I have really considered yet – it is just the default one on my Debian system.


I try to teach my student to be structured in their software development, so I suppose I ought to be also. I will start with the third entry in the above list: Check for duplicate data with Master repository

In the spirit of unit testing, I will first devise a set of test data and then issue

$ CheckDir --Primary=./Master/ --Secondary==./tmp/

The result will be a list of file names that are present both in the primary and the secondary directory. This would be version 0.1.

Later, it will expanded by caching of md5 sums. This is relevant for the .ISO files whose content never change, so it does not have to be recalculated every time.

I will post code and progress when I get there.

Written by moozing

July 5, 2010 at 09:00

Posted in Tech

Tagged with ,

2 Responses

Subscribe to comments with RSS.

  1. […] Posted on July 12, 2010 by moozing As stated last week, I will do my backup system properly and I start by doing a script that goes through the files […]

  2. […] 19, 2010 by moozing Leave a Comment Continuing the work on my archiving system, we have now come to the point where the actual programming is done. Last week, I discussed what I […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: