Jumping into the Crazy World of Computational Bio

My current project in the lab is an extension of Titus
Brown’s recent work that benchmarks our lab’s program, khmer, against other
sequence analysis tools (see Titus’ post
on k-mer counting).  We have been working
on a data
protocol
for mRNAseq assembly, and I have been charged with running the
benchmarks.

This post will focus on my initial run of the protocol as it
was in development. Coming from a computer science background, I felt it was
important for two reasons: 1) Get a feel for the process and what to expect 2)
Utilize Amazon’s built in CloudWatch to identify additional matrixes to hone in on.

With that in mind I ran the protocol exactly as written, and
followed the minimum recommend EC2 instances for each step of the tutorial recording run time information along the way.
Then I realized in order to fairly compare wall time all parts must be ran on
the same type of instance, a m1.xlarge. 
The results are interesting:

The total walltime was around 73 hours, with digital
normalization and error trimming only accounting for 7.5% of it.  This observation paired with the results
given in the k-mer counting blog post and upcoming paper, could indicate that the use of khmer
greatly reduces resource usage. However, this is an area that merits further
benchmarking matrixes as the software’s behavior can change from genome to genome.           

Also worth noting is that CloudWatch indicated possible
bottlenecks with disk usage and I/O.  For
example, I/O occurred throughout the runs, leading me to question Amazon’s
cloud architecture. Specifically, is the primary disk drive on the same internal bus as
the cpu or is it accessed through network ports? 

Instinctively, I believe that all parts of the virtual
machine are spread throughout the data center. Which would also account for
correspondence between read/write calls and I/O usage that I noticed as well. 

As a side note, I originally had ran the annotation commands on a m1.large instance. Switching to a m1.xlarge configuration knocked about 6 hours off the total walltime for the group of commands.

At the end of the day, the next step is to gather and
analyze: 1) individual cpu core usage,  2) memory usage, 3) read/write patterns and 4) I/O
patterns.

Thoughts, comments and/or general advice?

Leave a Reply

Your email address will not be published. Required fields are marked *