Friday, May 9, 2014

Converting a directory listing into a shell script of copy commands

I needed to copy all the files of a certain name throughout a directory tree into a particular directory, renaming them to contain the name of the path where they had been found so that they could be restored from the copy if needed later.

Trying to be quick about it, I decided to just pipe the result of "find" into a file, and then use a stream editor to modify the file into a shell script that would do the copy.

find . -name "filename.txt" > ~/filelist.txt

After fiddling with sed as an option for changing each line of the file into a copy command, I eventually found awk to be easier to use for this purpose (gawk on all our Linux systems).

gawk '{{ORS=""};print "cp -v " $0;gsub(/\//,"+",$0);{ORS="\n"};print " ./files_backup/" $0}' ~/filelist.txt > ~/files_backup.sh

So this reads filelist.txt which is a list of all found instances of filename.txt with full paths, creates a cp -v command with the original path as the first argument and a second argument with a destination directory of files_backup and a destination filename consisting of the first argument with all "/" characters replaced with +; the result is output to files_backup.sh which is the shell script that will actually do the copying for me. I didn't want to waste more time getting fancy by trying to directly execute the command, plus this gives me a chance to check the result before running it. The "+" character was chosen for the destination filenames because some of the directories in the source paths already had underscores and dashes in their names, and it appears as though the + is an allowable filename character.

I was bedeviled by awk's default behavior of inserting \n to the results of each print statement; I knew that I'd overcome this in the past but for today my Google search yielded a workaround of changing the end of line character to nothing for the first print statement (I had to change it back for the second print statement in order to get the completed line with a \n at the end.

Some links:

A typical sed replacement example forum thread:
http://www.unix.com/shell-programming-and-scripting/12657-using-sed-replace-part-string.html

A fascinating diversion into bash for loops. At one point I tried using a for loop to generate copy commands "live" from the file list but it didn't work because the list that for uses seems to need to be an actual directory listing rather than lines from a file.
http://www.cyberciti.biz/faq/bash-for-loop/

Here's a bunch of different approaches to a similar (although simpler) task. This thread has examples of using sed, a for loop, and the "rename" command which I hadn't heard about before as possible solutions.
http://unix.stackexchange.com/questions/7161/copy-rename-multiple-files-using-regular-expression-shell-script

A link with basic awk syntax:
http://www.grymoire.com/Unix/Awk.html

Another page of basic awk syntax that shows the syntax for the gsub command and an example:
http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_92.html

Another page of basic awk syntax, for print statments.
http://www.thegeekstuff.com/2010/01/awk-introduction-tutorial-7-awk-print-examples/

Here's a basic question that I keep forgetting the answer to with awk, how to control whether concatenated outputs are padded with white space or not; the answer is really quite simple!
http://stackoverflow.com/questions/9985330/how-can-i-get-awk-to-print-without-white-space

I was attempting to copy the input argument to a new variable and modify the new variable the first time I tried to do this, but for some reason the modifications that I was making to the new variable affected the input variable also. I didn't have time to figure out why this was happening, and thus had to reorder my awk statment such that I printed the input argument before modifying it, which is what led to having to fiddle with ORS. Perhaps something in this page of examples, which I was trying to follow but somehow failed, explains why I was having this problem:
http://www.thegeekstuff.com/2010/01/awk-tutorial-understand-awk-variables-with-3-practical-examples/

Here's where I got the solution to directly change ORS from:
http://stackoverflow.com/questions/2021982/awk-without-printing-newline

Here's the information on what are the allowed filename characters in *nix (and other OSs too):
http://en.wikipedia.org/wiki/Filename