Data wrangling is a very practical skill that you will definitely need in your data science or bioinformatics work. It is all about getting data into the right format so it can be used by various downstream tools, plotted visually, or analyzed statistically.

Sublime Text is a quick and powerful tool for data wrangling

You can edit multiple lines at once in Sublime Text by holding down the option key while dragging across the lines. In the video I use this to select all of the “chr” prefixes and delete them, and there are many other ways to use this powerful text editor for some pretty extensive data wrangling. You can learn more about Sublime Text here.

Excel has some decent data wrangling features that you can take advantage of

With GUI wizards and lots of guided tools and explanations, Excel can be used for some of the more basic data wrangling tasks. If you are already working with Excel, it is more convenient to fix a few issues without exporting your data first, and more things may be possible than you think. By Googling for how to split columns by a delimiter, I found out Excel has a neat little tool which does exactly that.

Awk is a powerful and flexible command-line language

There is a lot to say about awk since it is a full programming language in itself, even though we usually reserve it for one-liners. You can filter rows, change values, do calculations, and completely reformat a data file with awk. Here I will just show you how to easily filter the rows based on their content in each column.
If you need an introduction or a refresher on the command-line, check out this bash introduction first:
Intro to bash for data analysis

A full explanation is in the video, but here is the snippet of code.

awk '$7>=99.9 && $12=="chrX"' hg19_to_GRCh38.txt > hg19_to_GRCh38_filtered.txt

This checks that column 7 ($7) has a value greater than or equal to 99.9 AND (&&) that column 12 ($12) equals (==) “chrX”. Don’t forget to put awk at the beginning and encase the rest of the statement in single quotes (‘). The file name goes last, and then we used “>” to write the resulting output to a new file. You have to output to a different file, never to the same one, because the input file is being read gradually and prints gradually to the output, so if you read and write on the same file at the same time it will have issues.
Here are a few other command-line programs you can look into for specific problems you encounter:

  • grep for search
  • sort for sorting
  • uniq for consolidating redundant rows
  • sed for find/replace

That was a small taste of data wrangling with different tools

Data wrangling is not an easy thing to learn or to teach because it is different every time. It is something you will learn as you go. Every dataset has a unique set of issues you have to deal with, but if you are just familiar with a few tools that do similar things to what you are trying to accomplish, you can always Google the rest.

Related Posts


What is bioinformatics?

Bioinformatics is a huge field with many definitions. This video discusses a broader definition of bioinformatics based on the intersection of biology, computer science, and math/statistics. I also break research in the field up into Read more...


Is bioinformatics a lucrative career option for biologists?

In this video we talk about supply and demand for bioinformaticians, and why it is hard for professors to recruit bioinformatics post-docs — which is good for PhDs who want to fill that gap.


How to write a bash script that takes user input

Quick guide to writing a bash script on the Mac/Linux command-line Writing a bash script is important for setting up pipelines to process data or run a series of different tools. It also makes your Read more...