Best Practices for Scientific Computing

The following tweet found its way into my Twitter newsfeed.

I thought it would be appropriate to take the advice found in this article and apply them specifically to computational combinatorics. In particular, they identify and outline eight best practices, which I will discuss individually. Further, during the discussion I will include how I apply (or will apply) these concepts during my own development process. Below, I have simply copied their “Summary of Best Practices”. You can investigate each item in more detail in the original article. I’ll add my own ideas under each main topic.

Caveat: I showed this paper to a software engineer (my wife) and she thought all of these items would be obvious to anyone with even a basic understanding of software engineering. However, as the original paper is written for biologists and this blog is written for mathematicians, I find it appropriate to cover these ideas.

Write programs for people, not computers

This item I take to heart in every line of code that I write. Everything is designed to be correct first, human-checkable second, and optimized fifth: from how I break complicated tasks into chunks, to the in-line comments, to how whitespace is used for format.

1. A program should not require its readers to hold more than a handful of facts in memory at once.
2. Make names consistent, distinctive, and meaningful.
3. Make code style and formatting consistent.

Let the computer do the work

The whole reason we are using the computer is to let the computer do all of the tedious, boring things, so don’t try to do those things yourself! Let the computer run all of the individual cases. When the computer is all done with its output, I usually create a Sage worksheet whose entire purpose is to convert the text data dump into pretty PDF plots and LaTeX tables.

1. Make the computer repeat tasks.
2. Save recent commands in a file for re-use.
3. Use a build tool to automate workflows.

Make incremental changes

When creating a combinatorial search algorithm, the first development task is to get a working brute-force algorithm to work. Start by organizing the data so you can update items one-by-one and detect if you have a solution or not. Once you determine that your algorithm is checking all cases, you can integrate symmetry-aware branching processes such as canonical deletion or orbital branching. Once you determine that your algorithm is checking all cases up to symmetry, you can add your complicated pruning mechanisms.

1. Work in small steps with frequent feedback and course correction.
2. Use a version control system.
3. Put everything that has been created manually in version control.

Don’t repeat yourself (or others)

Whenever you find you need something more than once, you should pull that code out and into a library. Among my research software I have a “Utilities” project that stores all of these methods that I need more than once. In particular, this project contains all ranking and unranking algorithms. Also, once you figure out how to parallelize your algorithms, you will not want to write that code another time. That is how my TreeSearch project came about. It is so helpful to have that available when I want to make a new parallel algorithm!

1. Every piece of data must have a single authoritative representation in the system.
2. Modularize code rather than copying and pasting.
3. Re-use code instead of rewriting it.

Plan for mistakes

I know that I frequently make small mistakes during development, so I constantly check all input to every method to be a valid input. When making a change to a combinatorial object (such as adding an edge to a graph) I make sure that change is not reversing (or repeating) a previous choice. It is also helpful to have a “debug mode” that will print absolutely everything that is happening at every step. Checking these on small cases can be very helpful.

1. Add assertions to programs to check their operation.
2. Use an off-the-shelf unit testing library.
3. Turn bugs into test cases.
4. Use a symbolic debugger.

Optimize software only after it works correctly

Don Knuth famously said “Premature optimization is the root of all evil” and that is even more prevalent in computational combinatorics. At the scale of combinatorial explosion that we consider, a few constants here and there will may have a huge effect on our algorithms. However, it is much much much more important that we are performing correct computations! Once a complete and correct product is finished, then you can start doing optimizations. Shaving off some time here and there using slick programming is probably not as helpful as creating a new symmetry-aware algorithm, or performing custom augmentations, or finding a new pruning mechanism. Algorithmic improvements will almost always win out over code optimizations.

1. Use a profiler to identify bottlenecks.
2. Write code in the highest-level language possible.

Document design and purpose, not mechanics

If you have written code for a human, not the computer, then this part is somewhat obvious. However, these are excellent ideas for working with an existing codebase that requires more information before use by other people.

1. Document interfaces and reasons, not implementations.
2. Refactor code in preference to explaining how it works.
3. Embed the documentation for a piece of software in that software.

Collaborate

My personal research process is to collaborate on the mathematics and high-level algorithmic thinking, but then to hunker down and write code by myself (with a coauthor sometimes creating an independent implementation in order to check correctness). However, I will need to change this once I have a graduate student doing their own development.

1. Use pre-merge code reviews.
2. Use pair programming when bringing someone new up to speed and when tackling particularly tricky problems.
3. Use an issue tracking tool.

What else?

What do you do in your development process to help with these types of issues? Do you have any specific tools that you use? How about the very special situation of collaborating with graduate students on software and research papers? How do you keep everything straight?