Sometimes the simplest solution is the easiest

I had a long NA sequence in the middle of a much larger text file. Unfortunately this sequence had been corrupted and had a non-sequence character in it. The textfile was not in a regular sequence file format, and I could not use one of many utilities to reformat the file into an acceptable and correct file format. Furthermore since the sequence was in the middle of much more non-sequence text then I could not easily use any of the more traditional UNIX text utilities to find the erroneous character. Finally since the sequence text is all on a single line Grep just returns the entire line. Not too useful. Of course I could just simple read the line, but hey, who’s not lazy enough to ignore this possibility. I also quickly ignored the possibility of copy and pasting the line into a new file, as if I could not do this from the CLI then it would be cheating (well not cheating, but I’d spent enough minutes thinking about this to regard it as a ‘small’ challenge)..

As I’m currently (once again) in the middle of a “must learn Perl’ studying stint, I started to write a script to look for the start of the sequence data, then use a regex to find the naughty character.
Then I remembered that VI has great regex matching functionality. So I opened up the file, moved the cursor to the start of the sequence data and used /[^ACGT], which found the wrong character immediately.

EDIT I should also add that I was sat on a beach on the Messinian gulf at the time, and the post came vie email!

This entry was posted in Travel. Bookmark the permalink.