Small steps every coder should take to eliminate future headaches

By definition scientific researchers strive to find greater
understanding for their chosen communities. However, I am noticing that in the
field of computational computing, the desire to get to the next step is so
great simple “future-proofing” steps in our code is ignored.

My adventures in benchmarking The Eel Pond Protocol drove
home this point over the past few weeks.  As such I am going to share my experiences on
this project to illustrate areas that seem to pop up quite a bit in open-source
projects.

I have developed a make file that allows me to run the
entire protocol with one command line call. Based on the initial testing I
mentioned in my last post (found here), I decided to add some commands to break
down a few of the stats I had collected. Then, I begin a new run, then walked
away with the expectation that it would run as seamlessly as before.  Every several hours I peak in on the machine
just to make sure it is behaving; the control freak in me I suppose.

To my surprise, the first time I checked in, the script had
thrown an error and stopped. Luckily the stopping point was a khmer script
call, whose path was outdated due to package updates. At that point, I realized
I hadn’t subscribed to the mailing list thus missed the announcement. In this
case I plead ignorance, Titus and the team had announced the changes in a
timely matter, which is all we can ask. Though, I will say that not only had it
been moved from the sandbox (YAY, official release of a feature), the author
had also renamed the file so even a search of all of khmer files couldn’t find
it. Thus my first tip to open source coding is: if you are moving a script it
is good practice not to change the file name. This way if a user with a fair
amount of command line experience can find its new home!

Having fixed my broken script I was off to the races again! This
time I made it all the way to assembly (around 30 hours in), and realized that
Trinity had also pushed a new version out. Each release of the software
installs to a folder that contains the date it was made public. As a
consequence, our protocol that download and installs the latest release of
Trinity and the call path needed revision. While this is a rather trivial task,
the company has a 6-month update cycle it would need to be updated fairly
frequently. As a result, a few of us bounce ideas on the best way to make this
a relative path. The public protocol now reflects this; effectively eliminating
said maintenance to this part of the protocol.

As many of you are probably thinking, both of these are extremely minor issues in the big scheme of things. In the spirit of transparency, running the protocols to find the errors took exponentially more time then the process of fixing them. By the same token, once a project runs correctly, taking an extra hour or two to review and modify code from a “future-proofing perspective” we can save both time and resources down the line. Then again, I have the tendency to over complicate logic on the first run through, so I automatically go back in an effort to reduce runtime complexity. The simple “future-proofing” steps seem like a natural extension into my workflow.

 At the end of the day, there is a lingering thought that had I just wondered upon the tutorial in an effort to learn more computational biology I would have moved on to another source.

2 Comments:

  1. Yep. This experience (and others) are leading us to version everything to avoid this in the future. What I think you’re failing to account for is that the up-front labor involved in future proofing is actually much larger than you think: we don’t just have to cover the things you’re noticing, but everything that everyone could notice!

    My solution to this, generally, has been to use automated approaches to detecting such things. In the case of these pipelines, that’s hard because they’re both long-running and not automated!

    As I said in the "conversation" that Michael and I had about things like this (figshare.com/articles/GEDsubmissiontoFirstWorkshoponSustainableSoftwareforSciencePracticeandExperiences/791567) I think doing a better job with the critical but minute sandbox scripts would have been first on my list to change, had I a time machine. Oh well, hindsight.

    Both Trinity and we need to work out better release approaches, too.

    • The awesome part of an open-source project is we have plenty of hands in the fire to not only help identify the little things and address them quickly. Also, having a lab that is constantly working with the process helps, too!

      I completely agree with all your points that to totally "future-proof" would take much more time then I alluded to…. Steve Jobs refusal to release a machine in the earlier ’80s because it wasn’t prefect comes to mind.

      At the end of the day it is all a learning experience, and I think GED does an amazing job in finding the "little" take aways with each new step!

Leave a Reply

Your email address will not be published. Required fields are marked *