Python, R, and bash are the most useful languages to learn right now in bioinformatics. Deciding which one to start with depends on your goals…
Welcome to the very first episode of the OMGenomics show. Our first question is one I have been asked multiple times at conferences:
“I want to learn bioinformatics. Which programming language should I learn first?”
When we talk about learning bioinformatics, it is useful to divide the students up into two groups: the ones who don’t want to make their own software and the ones who do. Both of these groups will do data analysis, run statistical tests, make plots, and use bioinformatics software made by other scientists. But the second group will also make their own bioinformatics software for the community to use. If you need to make some specialized scripts for your own research but you are not releasing anything for other researchers in your field to use, then you are in the first group.
Bioinformaticians who don’t build tools
For the first group, you are likely going to get the most use out of R. Some people are a little stuck up about R, saying it is not a “real” programming language, but it definitely is, and it has a lot of cool things built into it that also makes it ideal for bioinformatics.
It has a built-in data type called a data frame that has the same column and row setup as an Excel spreadsheet, where your genes, cells, people, time points, etc. will be rows while your variables are columns. This makes a lot of sense as a way to think about most kinds of data, so the Python people have made a package called Pandas to copy some of this functionality into Python, though it doesn’t work as smoothly as data frames do natively in R. The packages available for R to do bioinformatics are great, ranging from RNAseq to phylogenetic trees, and these are super easy to install from CRAN or the BioConductor.
If you use the free Rstudio software as your programming environment then it is even easier to manage what you are doing, and I would highly recommend Rstudio. Another major advantage of R is ggplot2, an awesome package for making plots that gives you results really quickly with even minimal coding skills. I made a video course about ggplot on my personal youtube channel, just search for Plotting in R for Biologists, which includes a good getting started guide for R in general.
Bioinformaticians who build tools
For bioinformaticians who make their own software, I would recommend either R or Python, plus bash.
R is great for all the reasons I just described, but if you like coding more than statistics, you may enjoy Python’s style a lot more. That sounds like a contradiction: How could you possibly know you enjoy coding more than statistics when you are choosing your first programming language? I would suggest trying them both and seeing what you like best. I personally enjoy coding in Python more than in R because its rules make more sense and it feels more like a programming language. In my experience, it is also much easier to make a command-line tool in Python than in R, and Python also has some packages for bioinformatics that are quite useful.
As you can probably tell, I have used both R and Python a lot in my work, where I use R for plotting and statistics, while I use Python for basically everything else, ranging from merging variant call sets to providing back-end algorithms for my web applications.
It is also very important for bioinformaticians to learn Bash, which for all of our intents and purposes is interchangeable with shell, the command-line, or the terminal. Bash is the primary way to access your data on your institution’s cluster and to run most genomics and bioinformatics software. It is also very powerful for manipulating your data like sorting, filtering, or doing calculations between columns, which is available through various utilities.
In my experience, and everyone I have talked to about it, bash was confusing and scary at first, but when you get the hang out it you start to feel this power surging through you, and you can do things in second that would take you hours to do by hand. Even two years into it I would still learn something new in bash that would blow my mind and I would kick myself for wasting time having programmed it from scratch in Python.
R, Python, and bash
In summary, for wet-lab people who want to add bioinformatics to their toolbox, focus on learning R first and applying it to your own work. For people who want to focus on bioinformatics as a career and make their own tools too, I would actually recommend learning the trifecta of R, Python, and Bash, though you could get away with choosing between R and Python as long as you still learn Bash too. I can go into more depth on any of these topics or give an introduction to any of these languages if you let me know in the comments.
Other programming languages
There are many other languages out there, so before I end here I’m going to give a brief reason why these are not recommended for bioinformatics, beginners, or anyone at all in some cases.
C and C++
C or C++ are great for making super optimized command-line tools like aligners and variant-callers, but you will have a much easier time learning Python first and then going to these high-performance languages for a particular problem in the future, since they are harder to learn, more finicky, and take a lot more code to do the same thing.
Perl is still what a lot of people use, but it is fading out of use because Python accomplishes the same tasks and is easier to write code for, especially for beginners.
Ruby is one of those hot languages right now, for good reason largely because of the power of Ruby on Rails for making database-driven web applications like blogs or twitter. Ruby however is not great for bioinformatics because it lacks the community support in terms of packages that R and Python have, so you would be better off learning Python instead of Ruby.
Java is a popular language that most people have heard of. In bioinformatics, a notable example is the genome browser IGV. However, I would not recommend for beginners to learn Java due to many issues including memory management and that Python and R have many more bioinformaticians who build packages and answer questions online.
That’s all I have to say about bioinformatics programming languages for now. If you want to see more videos like this about bioinformatics, then make sure to subscribe on YouTube and sign up for updates below to get new videos, guides, and scripts about bioinformatics delivered to your email inbox every week.
And if you have a question you would like me to answer on the show, you can send it to me by going to omgenomics.com/tv and typing in your question there.