Main Computing Skills for Biologists: A Toolbox

Computing Skills for Biologists: A Toolbox

5.0 / 5.0
How much do you like this book?
What’s the quality of the file?
Download the book for quality assessment
What’s the quality of the downloaded files?
A concise introduction to key computing skills for biologists
While biological data continues to grow exponentially in size and quality, many of today’s biologists are not trained adequately in the computing skills necessary for leveraging this information deluge. In Computing Skills for Biologists, Stefano Allesina and Madlen Wilmes present a valuable toolbox for the effective analysis of biological data.
Based on the authors’ experiences teaching scientific computing at the University of Chicago, this textbook emphasizes the automation of repetitive tasks and the construction of pipelines for data organization, analysis, visualization, and publication. Stressing practice rather than theory, the book’s examples and exercises are drawn from actual biological data and solve cogent problems spanning the entire breadth of biological disciplines, including ecology, genetics, microbiology, and molecular biology. Beginners will benefit from the many examples explained step-by-step, while more seasoned researchers will learn how to combine tools to make biological data analysis robust and reproducible. The book uses free software and code that can be run on any platform.
Computing Skills for Biologists is ideal for scientists wanting to improve their technical skills and instructors looking to teach the main computing tools essential for biology research in the twenty-first century.

• Excellent resource for acquiring comprehensive computing skills
• Both novice and experienced scientists will increase efficiency by building automated and reproducible pipelines for biological data analysis
• Code examples based on published data spanning the breadth of biological disciplines
• Detailed solutions provided for exercises in each chapter
• Extensive companion website
Princeton University Press
440 / 441
ISBN 10:
ISBN 13:
PDF, 4.76 MB
Download (pdf, 4.76 MB)

You may be interested in Powered by Rec2Me


Most frequently terms


To post a review, please sign in or sign up
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.

Die griechischen Phylen: Funktion - Entstehung - Leistungen

PDF, 1,41 MB
0 / 0

Mechanics for JEE (Main & Advanced) Volume 1

PDF, 19,72 MB
0 / 0

Computing Skills for Biologists

Computing Skills
for Biologists
• • • • • • • • • • • • • • • • • • • • • • • • •


Stefano Allesina & Madlen Wilmes



Copyright © 2019 by Princeton University Press

Published by Princeton University Press
41 William Street, Princeton, New Jersey 08540

6 Oxford Street, Woodstock, Oxfordshire OX20 1TR

All Rights Reserved

Library of Congress Control Number: 2018935166
ISBN 9780691167299

ISBN (pbk.) 9780691182759

British Library Cataloging-in-Publication Data is available

Editorial: Alison Kalett, Lauren Bucca, and Kristin Zodrow
Production Editorial: Mark Bellis

Text and Cover Design: Lorraine Doneker
Cover Credit: Desmond Paul Henry, 1963, courtesy of the D. P. Henry Archive

Production: Erin Suydam
Publicity: Alyssa Sanford

Copyeditor: Alison Durham

This book has been composed in MinionPro

Printed on acid-free paper.∞

Printed in the United States of America

1 3 5 7 9 10 8 6 4 2

To all biologists
who think they can’t code.

• • • • • • • • • • •

Science is what we understand

well enough to explain to a

computer. Art is everything else

we do. —Donald E. Knuth

Summary of Contents
• • • • • • • • • • • • • • • • • • • •

List of Figures xix
Acknowledgments xxi

0 Introduction 1

1 Unix 12

2 Version Control 55

3 Basic Programming 81

4 Writing Good Code 120

5 Regular Expressions 165

6 Scientific Computing 185

7 Scientific Typesetting 220

8 Statistical Computing 249

9 Data Wrangling and Visualization 300

10 Relational Databases 337

11 Wrapping Up 366

Intermezzo Solutions 373
Bibliography 389
Indexes 393

• • • • • • • •

List of Figures xix

Acknowledgments xxi

0 Introduction: Building a Computing Toolbox 1

0.1 The Philosophy 2
0.2 The Structure of the Book 4

0.2.1 How to Read the Book 6
0.2.2 Exercises and Further Reading 6

0.3 Use in the Classroom 8
0.4 Formatting of the Boo; k 10
0.5 Setup 10

1 Unix 12

1.1 What Is Unix? 12
1.2 Why Use Unix and the Shell? 13
1.3 Getting Started with Unix 14

1.3.1 Installation 14
1.3.2 Directory Structure 15

1.4 Getting Started with the Shell 17
1.4.1 Invoking and Controlling Basic Unix Commands 18
1.4.2 How to Get Help in Unix 19
1.4.3 Navigating the Directory System 20

1.5 Basic Unix Commands 22
1.5.1 Handling Directories and Files 22
1.5.2 Viewing and Processing Text Files 24

1.6 Advanced Unix Commands 27
1.6.1 Redirection and Pipes 27
1.6.2 Selecting Columns Using cut 29
1.6.3 Substituting Characters Using tr 32

x ● Contents

1.6.4 Wildcards 35
1.6.5 Selecting Lines Using grep 36
1.6.6 Finding Files with find 39
1.6.7 Permissions 41

1.7 Basic Scripting 43
1.8 Simple for Loops 47
1.9 Tips, Tricks, and Going beyond the Basics 49

1.9.1 Setting a PATH in .bash_profile 49
1.9.2 Line Terminators 50
1.9.3 Miscellaneous Commands 50

1.10 Exercises 51
1.10.1 Next Generation Sequencing Data 51
1.10.2 Hormone Levels in Baboons 51
1.10.3 Plant–Pollinator Networks 52
1.10.4 Data Explorer 53

1.11 References and Reading 53

2 Version Control 55

2.1 What Is Version Control? 55
2.2 Why Use Version Control? 55
2.3 Getting Started with Git 56

2.3.1 Installing Git 57
2.3.2 Configuring Git after Installation 57
2.3.3 How to Get Help in Git 58

2.4 Everyday Git 58
2.4.1 Workflow 58
2.4.2 Showing Changes 64
2.4.3 Ignoring Files and Directories 65
2.4.4 Moving and Removing Files 66
2.4.5 Troubleshooting Git 66

2.5 Remote Repositories 68
2.6 Branching and Merging 70
2.7 Contributing to Public Repositories 78
2.8 References and Reading 79

3 Basic Programming 81

3.1 Why Programming? 81
3.2 Choosing a Programming Language 81
3.3 Getting Started with Python 83

Contents ● xi

3.3.1 Installing Python and Jupyter 83
3.3.2 How to Get Help in Python 84
3.3.3 Simple Calculations with Basic Data Types 85
3.3.4 Variable Assignment 87
3.3.5 Built-In Functions 89
3.3.6 Strings 90

3.4 Data Structures 93
3.4.1 Lists 93
3.4.2 Dictionaries 96
3.4.3 Tuples 100
3.4.4 Sets 101

3.5 Common, General Functions 103
3.6 The Flow of a Program 105

3.6.1 Conditional Branching 105
3.6.2 Looping 107

3.7 Working with Files 112
3.7.1 Text Files 112
3.7.2 Character-Delimited Files 115

3.8 Exercises 117
3.8.1 Measles Time Series 117
3.8.2 Red Queen in Fruit Flies 118

3.9 References and Reading 118

4 Writing Good Code 120

4.1 Writing Code for Science 120
4.2 Modules and Program Structure 121

4.2.1 Writing Functions 121
4.2.2 Importing Packages and Modules 126
4.2.3 Program Structure 127

4.3 Writing Style 133
4.4 Python from the Command Line 135
4.5 Errors and Exceptions 137

4.5.1 Handling Exceptions 138
4.6 Debugging 139
4.7 Unit Testing 146

4.7.1 Writing the Tests 147
4.7.2 Executing the Tests 149
4.7.3 Handling More Complex Tests 150

xii ● Contents

4.8 Profiling 153
4.9 Beyond the Basics 155

4.9.1 Arithmetic of Data Structures 155
4.9.2 Mutable and Immutable Types 156
4.9.3 Copying Objects 158
4.9.4 Variable Scope 160

4.10 Exercises 161
4.10.1 Assortative Mating in Animals 161
4.10.2 Human Intestinal Ecosystems 162

4.11 References and Reading 163

5 Regular Expressions 165

5.1 What Are Regular Expressions? 165
5.2 Why Use Regular Expressions? 165
5.3 Regular Expressions in Python 166

5.3.1 The reModule in Python 166
5.4 Building Regular Expressions 167

5.4.1 Literal Characters 168
5.4.2 Metacharacters 168
5.4.3 Sets 169
5.4.4 Quantifiers 170
5.4.5 Anchors 171
5.4.6 Alternations 172
5.4.7 Raw String Notation and Escaping

Metacharacters 173
5.5 Functions of the reModule 175
5.6 Groups in Regular Expressions 179
5.7 Verbose Regular Expressions 181
5.8 The Quest for the Perfect Regular Expression 181
5.9 Exercises 182

5.9.1 Bee Checklist 182
5.9.2 A Map of Science 182

5.10 References and Reading 184

6 Scientific Computing 185

6.1 Programming for Science 185
6.1.1 Installing the Packages 185

Contents ● xiii

6.2 Scientific Programming with NumPy and SciPy 185
6.2.1 NumPy Arrays 186
6.2.2 Random Numbers and Distributions 194
6.2.3 Linear Algebra 196
6.2.4 Integration and Differential Equations 197
6.2.5 Optimization 200

6.3 Working with pandas 202
6.4 Biopython 208

6.4.1 Retrieving Sequences from NCBI 208
6.4.2 Input and Output of Sequence Data

Using SeqIO 210
6.4.3 Programmatic BLAST Search 212
6.4.4 Querying PubMed for Scientific Literature

Information 214
6.5 Other Scientific Python Modules 216
6.6 Exercises 216

6.6.1 Lord of the Fruit Flies 216
6.6.2 Number of Reviewers and Rejection Rate 217
6.6.3 The Evolution of Cooperation 217

6.7 References and Reading 219

7 Scientific Typesetting 220

7.1 What Is LATEX? 220
7.2 Why Use LATEX? 220
7.3 Installing LATEX 223
7.4 The Structure of LATEX Documents 223

7.4.1 Document Classes 224
7.4.2 LATEX Packages 224
7.4.3 The Main Body 225
7.4.4 Document Sections 227

7.5 Typesetting Text with LATEX 228
7.5.1 Spaces, New Lines, and Special Characters 228
7.5.2 Commands and Environments 228
7.5.3 Typesetting Math 229
7.5.4 Comments 231
7.5.5 Justification and Alignment 232
7.5.6 Long Documents 232
7.5.7 Typesetting Tables 233
7.5.8 Typesetting Matrices 236

xiv ● Contents

7.5.9 Figures 237
7.5.10 Labels and Cross-References 240
7.5.11 Itemized and Numbered Lists 241
7.5.12 Font Styles 241
7.5.13 Bibliography 242

7.6 LATEX Packages for Biologists 244
7.6.1 Sequence Alignments with LATEX 245
7.6.2 Creating Chemical Structures with LATEX 246

7.7 Exercises 246
7.7.1 Typesetting Your Curriculum Vitae 246

7.8 References and Reading 247

8 Statistical Computing 249

8.1 Why Statistical Computing? 249
8.2 What Is R? 249
8.3 Installing R and RStudio 250
8.4 Why Use R and RStudio? 250
8.5 Finding Help 251
8.6 Getting Started with R 251
8.7 Assignment and Data Types 253
8.8 Data Structures 255

8.8.1 Vectors 255
8.8.2 Matrices 257
8.8.3 Lists 261
8.8.4 Strings 262
8.8.5 Data Frames 263

8.9 Reading and Writing Data 264
8.10 Statistical Computing Using Scripts 267

8.10.1 Why Write a Script? 267
8.10.2 Writing Good Code 267

8.11 The Flow of the Program 270
8.11.1 Branching 270
8.11.2 Loops 272

8.12 Functions 275
8.13 Importing Libraries 278
8.14 Random Numbers 279
8.15 Vectorize It! 280
8.16 Debugging 283
8.17 Interfacing with the Operating System 284

Contents ● xv

8.18 Running R from the Command Line 285
8.19 Statistics in R 287
8.20 Basic Plotting 290

8.20.1 Scatter Plots 290
8.20.2 Histograms 291
8.20.3 Bar Plots 292
8.20.4 Box Plots 292
8.20.5 3D Plotting (in 2D) 293

8.21 Finding Packages for Biological Research 293
8.22 Documenting Code 294
8.23 Exercises 295

8.23.1 Self-Incompatibility in Plants 295
8.23.2 Body Mass of Mammals 296
8.23.3 Leaf Area Using Image Processing 296
8.23.4 Titles and Citations 297

8.24 References and Reading 297

9 Data Wrangling and Visualization 300

9.1 Efficient Data Analysis and Visualization 300
9.2 Welcome to the tidyverse 300

9.2.1 Reading Data 301
9.2.2 Tibbles 302

9.3 Selecting and Manipulating Data 304
9.3.1 Subsetting Data 305
9.3.2 Pipelines 307
9.3.3 Renaming Columns 308
9.3.4 Adding Variables 309

9.4 Counting and Computing Statistics 310
9.4.1 Summarize Data 310
9.4.2 Grouping Data 310

9.5 Data Wrangling 313
9.5.1 Gathering 313
9.5.2 Spreading 315
9.5.3 Joining Tibbles 316

9.6 Data Visualization 318
9.6.1 Philosophy of ggplot2 319
9.6.2 The Structure of a Plot 320
9.6.3 Plotting Frequency Distribution of One

Continuous Variable 321

xvi ● Contents

9.6.4 Box Plots and Violin Plots 322
9.6.5 Bar Plots 323
9.6.6 Scatter Plots 324
9.6.7 Plotting Experimental Errors 325
9.6.8 Scales 326
9.6.9 Faceting 328
9.6.10 Labels 329
9.6.11 Legends 330
9.6.12 Themes 331
9.6.13 Setting a Feature 332
9.6.14 Saving 332

9.7 Tips & Tricks 333
9.8 Exercises 335

9.8.1 Life History in Songbirds 335
9.8.2 Drosophilidae Wings 335
9.8.3 Extinction Risk Meta-Analysis 335

9.9 References and Reading 336

10 Relational Databases 337

10.1 What Is a Relational Database? 337
10.2 Why Use a Relational Database? 338
10.3 Structure of Relational Databases 340
10.4 Relational Database Management Systems 341

10.4.1 Installing SQLite 341
10.4.2 Running the SQLite RDBMS 341

10.5 Getting Started with SQLite 342
10.5.1 Comments 342
10.5.2 Data Types 342
10.5.3 Creating and Importing Tables 343
10.5.4 Basic Queries 344

10.6 Designing Databases 352
10.7 Working with Databases 355

10.7.1 Joining Tables 355
10.7.2 Views 358
10.7.3 Backing Up and Restoring a Database 359
10.7.4 Inserting, Updating, and Deleting Records 360
10.7.5 Exporting Tables and Views 361

10.8 Scripting 362
10.9 Graphical User Interfaces (GUIs) 362

Contents ● xvii

10.10 Accessing Databases Programmatically 362
10.10.1 In Python 363
10.10.2 In R 363

10.11 Exercises 364
10.11.1 Species Richness of Birds in Wetlands 364
10.11.2 Gut Microbiome of Termites 364

10.12 References and Reading 365

11 Wrapping Up 366

11.1 How to Be a More Efficient Computational Biologist 367
11.2 What Next? 368
11.3 Conclusion 371

Intermezzo Solutions 373
Bibliography 389
Indexes 393
Index of Symbols 393
Index of Unix Commands 395
Index of Git Commands 397
Index of Python Functions, Methods, Properties,
and Libraries 399

Index of LATEX Commands and Libraries 401
Index of R Functions and Libraries 403
Index of SQLite Commands 405
General Index 407

• • • • • • •

1.1 Directory structure. 16
2.1 Basic Git workflow. 59
2.2 Branching in Git. 72
10.1 Structure of a relational database. 340
10.2 Schema of a relational database. 354
10.3 Inner and outer joins. 357

• • • • • • • • • • • • • • • • •

This book grew out of the lecture notes for the graduate class Introduction
to Scientific Computing for Biologists, taught by Stefano at the University
of Chicago. We would like to thank all the students who took the class—
especially those enrolled in Winter 2016 and 2017, who have test-driven the
book. The class was also taught in abbreviated form at various locations:
thanks to the TOPICOS students at the University of Puerto Rico Río Piedras,
to those attending the 2014 Spring School of Physics at theAbdus Salam Inter-
national Center for Theoretical Physics in Trieste, to the participants in the
Mini-Course BIOS 248 at the Hopkins Marine Station of Stanford University,
and to the students who joined the University of Chicago BSD QBio Boot
Camps held at the Marine Biological Laboratory in Woods Hole, MA.

Several people read incomplete drafts of the book, or particular chapters.
Thanks to Michael Afaro and anonymous referees for the critical and con-
structive feedback. Alison Kalett and her team at Princeton University Press
provided invaluable support throughout all the stages of this long journey.

We are grateful to all the scientists who uploaded their data to the Dryad
Digital Repository, allowing other researchers to use it without restrictions:
it’s thanks to them that all the exercises in this bookmake use of real biological
data, coming from real papers.

The development of this book was supported by the National Science
Foundation CAREER award #1148867.

Stefano: I started programming late in my college years, thanks to a
class taught by Gianfranco Rossi. This class was a real revelation: I found
out that I loved programming, and that it came very naturally to me. After
learning a lot from my cousin and friend Davide Lugli, I even started work-
ing as a software developer for a small telephone company. In the final
year of college, programming turned out to be very useful for my honors
thesis. I worked with Alessandro Zaccagnini, who introduced me to the
beauty of LATEX. When I started graduate school, my advisor Antonio Bodini
encouraged me to keep working on my computing skills. Stefano Leonardi
convinced me to switch to Linux, pressed me to learn C, and introduced

xxii ● Acknowledgments

me to R. Many people are responsible for my computational toolbox, but I
want to mention Daniel Stouffer, who gave me a crash course in svn, and Ed
Baskerville, who championed the use of Git. I thank my students and post-
docs (in order of appearance Anna Eklöf, Si Tang, Phillip Staniczenko, Liz
Sander, Matt Michalska-Smith, Samraat Pawar, Gyuri Barabás, Jacopo Grilli,
Madlen Wilmes, Carlos Marcelo-Sérvan, Dan Maynard, and Zach Miller)
and the other members of the lab for coping with my computational quirks
and demands, and for learning with me many of the tools covered in this
book. Finally, I want to thank my parents, Gianni and Grazia for buying
me a computer instead of a motorcycle, my brother Giulio, and my family,
Elena, Luca & Marta, for love and support.

Madlen: I earned a five-year degree in biology without a single hour of
computational training. That turned out to be a tremendous problem. Fellow
PhD student Illka Kronholm took the time to teach R to the “Germanwithout
humor.” Later, Ben Brachi generously shared his scripts and contributed to
my fluency in R. I am also grateful to Marco Mambelli who introduced me to
cluster computing and helped me to get a grip on Unix.

My advisor and coauthor Stefano Allesina had, undoubtedly, the biggest
impact on my programming skills. His course, Introduction to Scientific
Computing, was my first experience of a well-structured and constructive
class centered on computing skills. And so the idea for this book was born, as
I wished every student could have a resource to help overcome the initial steep
learning curve of many computing skills, and use examples that were actually
relevant to a biologist’s daily work. I am tremendously grateful to Stefano for
agreeing to write this book together. In the process I not only became a more
proficient programmer and better organized scientist, but also felt inspired by
his productivity and positive attitude.

My dad-in-law, George Wilmes, provided valuable feedback on every
chapter and took care of my kids so I could work on this book. Last but not
least I want to thankmy parents andmy husband John for helpful suggestions,
love, and support.

C H A P T E R 0
• • • • • • • • • • • • •

Introduction: Building a Computing Toolbox

Nomatter howmuch time you spend in the field or at the bench, most of your
research is done when sitting in front of a computer. Yet, the typical curricu-
lum of a biology PhD does not include much training on how to use these
machines. It is assumed that students will figure things out by themselves,
unless they join a laboratory devoted to computational biology—in which
case they will likely be trained by other members of the group in the labora-
tory’s (often idiosyncratic) selection of software tools. But for the vastmajority
of academic biologists, these skills are learned the hard way—through painful
trial and error, or during long sessions sitting with the one student in the
program who is “good with computers.”

This state of affairs is at odds with the enormous growth in the size and
complexity of data sets, as well as the level of sophistication of the statistical
and mathematical analysis that goes into a modern scientific publication in
biology. If, once upon a time, coming up with an original idea and collecting
great data meant having most of the project ready, today the data and ideas
are but the beginning of a long process, culminating in publication.

The goal of this book is to build a basic computational toolbox for biolo-
gists, useful both for those doing laboratory and field work, and for those with
a computational focus. We explore a variety of tools and show how they can
be integrated to construct complex pipelines for automating data collection,
storage, analysis, visualization, and the preparation of manuscripts ready for

These tools are quite disparate and can be thought of as LEGO® bricks,
that can be combined in new and creative ways. Once you have added a new
tool to your toolbox, the potential for new research is greatly expanded. Not
only will you be able to complete your tasks in amore organized, efficient, and
reproducible way, but you will attempt answering new questions that would
have been impossible to tackle otherwise.

2 ● Chapter 0

0.1 The Philosophy

Fundamentally, this book is a manifesto for a certain approach to computing
in biology. Here are the main points we want to emphasize:


Doing science involves repeating the same tasks several times. For example,
you might need to repeat an analysis when new data are added, or if the same
analysis needs to be carried out on separate data sets, or again if the reviewers
ask you to change this or that part of the analysis to make sure that the results
are robust.

In all of these cases you would like to automate the processing of the data,
such that the data organization and analysis and the production of figures and
statistical results can be repeated without any effort. Throughout the book, we
keep automation at the center of our approach.


Science should be reproducible, and much discussion and attention goes into
carefully documenting empirical experiments so that they can be repeated.
In theory, reproducing statistical analysis or simulations should be much eas-
ier, provided that the data and parameters are available. Yet, this is rarely
the case—especially when the processing of the data involves clicking one’s
way through a graphical interface without documenting all the steps. In
order to make it easy to reproduce your results, your computational work
should be

readable: Your analysis should be easy to read and understand. This involves writing
good code and documenting what you are doing. The best way to proceed is to
think of your favorite reader: yourself, six months from now. When you receive
feedback from the reviewers, and you have to modify the analysis, will you be able
to understand precisely what you did, how, and why? Note that there is no way to
email yourself in the past to ask for clarifications.

organized: Keeping the project tidy and well organized is a struggle, but you don’t
want to open your project directory only to find that there are 16 versions of the
same program, all with slight—and undocumented—variations!

Introduction: Building a Computing Toolbox ● 3

self-contained: Ideally, you want all of your data, code, and results in the same place,
without dependencies on other files or code that are not in the same location.
In this way, it is easy to share your work with others, or to work on your projects
from different computers.


Science is a worldwide endeavor. If you use costly, proprietary software, the
chances are that researchers in less fortunate situations cannot reproduce your
results or use your methods to analyze their data. Throughout the book, we
focus on free software:1 not only is the software free in the sense that it costs
nothing, but free alsomeans that you have the freedom to run, copy, distribute,
study, change, and improve the software.


Try to keep your analysis as simple as possible. Sometimes, “readable” and
“clever” are at odds, meaning that a single line of code processing data in 14
different ways at oncemight be genius, but seldom is it going to be readable. In
such cases, we tend to side with readability and simplicity—even if this means
writing three additional lines of code. We also advocate the use of plain text
whenever possible, as text is portable to all computer architectures and will
be readable decades from now.


Your analysis should be correct. This means that programming in science is
very different from programming in other areas. For example, bugs (errors
in the code) are something the software industry has learned to manage and
live with—if your application unexpectedly closes or if your word processor
sometimes goes awry, it is surely annoying, but unless you are selling pace-
makers this is not going to be a threat. In science, it is essential that your code
does solely what it is meant to do: otherwise your results might be unjustified.
This strong emphasis on correctness is peculiar to science, and therefore you


4 ● Chapter 0

will not find all of thematerial we present in a typical programming textbook.
We explore basic techniques meant to ensure that your code is correct and we
encourage you to rewrite the same analysis in (very) different programming
languages, forcing you to solve the problem in different ways; if all programs
yield exactly the same results, then they are probably correct.

Science as Software Development

There is a striking parallel between the process of developing software and
that of producing science. In fact, we believe that basic tools adopted by soft-
ware developers (such as version control) can naturally be adapted to the
world of research. We want to build software pipelines that turn ideas and
data into published work; the development of such a pipeline has important
milestones, which parallel those of software development: one can think of
a manuscript as a “beta version” of a paper, and even treat the comments of
the reviewers as bugs in the project which we need to fix before releasing our
product! The development of these pipelines is another central piece of our

0.2 The Structure of the Book

The book is composed of 10 semi-independent chapters:

Chapter 1: Unix
We introduce the Unix command line and show how it can be used to
automate repetitive tasks and “massage” your data prior to analysis.

Chapter 2: Version control
Version control is a way to keep your scientific projects tidily organized,
collaborate on science, and have the whole history of each project at your
fingertips. We introduce this topic using Git.

Chapter 3: Basic programming
We start programming, using Python as an example. We cover the basics:
from assignments and data structures to the reading and writing of files.

Chapter 4: Writing good code
When we write code for science, it has to be correct. We show how to
organize your code in an effective way, and introduce debugging, unit
testing, and profiling, again using Python.

Introduction: Building a Computing Toolbox ● 5

Chapter 5: Regular expressions
When working with text, we often need to find snippets of text matching a
certain “pattern.” Regular expressions allow you to describe to a computer
what you are looking for. We show how to use the Python module re to
extract information from text.

Chapter 6: Scientific computing
Modern programming languages offer specific libraries and packages for
performing statistics, simulations, and implementing mathematical models.
We briefly cover these tools using Python. In addition, we introduce
Biopython, which facilitates programming for molecular biology.

Chapter 7: Scientific typesetting
We introduce LATEX for scientific typesetting of manuscripts, theses, and

Chapter 8: Statistical computing
We introduce the statistical software R, which is fully programmable and for
which thousands of packages written by scientists for scientists are available.

Chapter 9: Data wrangling and visualization
We introduce the tidyverse, a set of R packages that allow you to write
pipelines for the organization and analysis of large data sets. We also show
how to produce beautiful figures using ggplot2.

Chapter 10: Relational Databases
We present relational databases and sqlite3 for storing and working
efficiently with large amounts of data.

Clearly, there is no way to teach these computational tools in 10 brief
chapters. In fact, in your library you will find several thick books devoted to
each and every one of the tools we are going to explore. Similarly, becoming
a proficient programmer cannot be accomplished by reading a few pages, but
rather it requires hundreds of hours of practice. So why try to cover so much
material instead of concentrating on a few basic tools?

The idea is to provide a structured guide to help jump-start your learning
process for each of these tools. This means that we emphasize breadth over
depth (a very unusual thing to do in academia!) and that success strongly
depends on your willingness to practice by trying your hand at the exer-
cises and embedding these tools in your daily work. Our goal is to showcase
each tool by first explaining what the tool is and why you should mas-
ter it. This allows you to make an informed decision on whether to invest
your time in learning how to use it. We then guide you through some basic

6 ● Chapter 0

features and give you a step-by-step explanation of several simple examples.
Once you have worked through these examples, the learning curve will
appear less steep, allowing you to find your own path toward mastering the

0.2.1 How to Read the Book

We have written the book such that it can be read in the traditional way: start
from the first page and work your way toward the end. However, we have
striven to provide a modular structure, so that you can decide to skip some
chapters, focus on only a few, or use the book as a quick reference.

In particular, the chapters on Unix (ch. 1), version control (ch. 2), LATEX
(ch. 7), and databases (ch. 10) can be read quite independently: you will
sometimes find references to other chapters, but in practice there are no
prerequisites. Also, you can decide to skip any of these chapters (though
we love each of these tools!) without affecting the reading of the rest of the

We present programming in Python (chs. 3–6) and then again in R
(chs. 8–9). While we go into more detail when explaining basic concepts in
Python, you should be able to understand all of the Rmaterial without having
read any of the other chapters. Similarly, if you do not plan to use R, you can
skip these chapters without impacting the rest of the book.

0.2.2 Exercises and Further Reading

In each chapter, upon completion of the material you will be ready to start
working on the “Exercises” section. One of the main features of this book is
that exercises are based on real biological data taken from published papers.
As such, these are not silly little exercises, but rather examples of the chal-
lenges you will overcome when doing research. We have seen that some
students find this level of difficulty frustrating. It is entirely normal, however,
to have no idea how to solve a problem at first. Whenever you feel that frus-
tration is blocking your creativity and efficiency, take a short break. When
you return, try breaking the problem into smaller steps, or start from a blank
slate and attempt an entirely different approach. If you keep chipping away
at the exercise, then little by little you will make sense of what the problem
entails and—finally—you will find a way to crack it. Learning how to enjoy
problem solving and to take pride in a job well done are some of the main
characteristics of a good scientist.

Introduction: Building a Computing Toolbox ● 7

For example, we are fond of this quote from Andrew Wiles (who proved
the famous Fermat’s last theorem, which baffled mathematicians for cen-
turies): “You enter the first room of the mansion and it’s completely dark. You
stumble around bumping into the furniture but gradually you learn where
each piece of furniture is. Finally, after six months or so, you find the light
switch, you turn it on, and suddenly it’s all illuminated.”2 Hopefully, it will
take you less than six months to crack the exercises!

Note that there isn’t “a” way to solve a problem, but rather a multitude
of (roughly) equivalent ways to do so. Each and every approach is perfect,
provided that the results are correct and that the solution is found in a rea-
sonable amount of time. Thus, we encourage you to consult our solutions
to the exercises only once you have solved them: Did we come up with the
same idea? What are the advantages and disadvantages of these approaches?
Even if you did not solve the task entirely, you have likely learned a lot more
while trying, compared to reading through the solutions upon hitting the first
stumbling block. To provide a further stepping stone between having no idea
where to start and a complete solution, we provide a pseudocode solution
of each exercise online: the individual steps of the solution are described in
English, but no code is provided. This will give you an idea how to approach
the problem, but youwill need to come upwith the code. From there, it is only
a short way to tackling your very own research questions. You can find the
complete solutions and the pseudocode at

When solving the exercises, the internet is your friend. Finding help
online is by no means considered “cheating.” On the contrary, if you find
yourself exploring additional resources, you are doing exactly the right thing!
As with research, anything goes, as long as you can solve your problem (and
give credit where credit is due). Consulting the many comprehensive online
forums gives you a sense of how widespread these computational tools are.
Keep in mind that the people finding clever answers to your questions also
started from a blank slate at some point in their career. Moreover, seeing that
somebody else asked exactly your question should further convince you that
you are on the right track.

Last but not least, the “Reading” section of each chapter contains refer-
ences to books, tutorials, and online resources to further the knowledge of
the material. If the chapter is an appetizer, meant to whet your appetite for
knowledge, the actual meal is contained in the reading list. This book is a
road map that equips you with sufficient knowledge to choose the appropri-
ate tool for each task, and take the guesswork out of “Where should I start


8 ● Chapter 0

my learning journey?” However, only by reading more on the topic and by
introducing these tools into your daily research work will you be able to truly
master these skills, and make the most of your computer.

We conclude with the sales pitch we use to present the class that inspired
this book. If you are a graduate student and you read the material, you
work your way through all the exercises, constantly striving to further your
knowledge of these topics by introducing them into your daily work, then you
will shave sixmonths off your PhD—and not any six months, but rather those
spent wrestling with the data, repeating tedious tasks, and trying to convince
the computer to be reasonable and spit out your thesis. All things considered,
this book aims to make you a happier, more productive, and more creative
scientist. Happy computing!

0.3 Use in the Classroom

We have been teaching the material covered in this book, in the graduate
class Introduction to Scientific Computing for Biologists, at the University of
Chicago since 2012. The enrollment has been about 30 students per year. We
found the material appropriate for junior graduate students as well as senior
undergraduates with some research experience.

The University of Chicago runs on a quarter system, allowing for 10 lec-
tures of three hours each. Typically, each chapter is covered by a single lecture,
with “Version Control” (ch. 2) and “Scientific Typesetting” (ch. 7) each tak-
ing about an hour and a half, and “Writing GoodCode” (ch. 4) and “Statistical
Computing” (ch. 8) taking more than one lecture each.

In all cases, we taught students who had computers available in class,
either by teaching in a computer lab, or by asking students to bring their per-
sonal laptops. Rather than using slides, the instructor lecturedwhile typing all
the code contained in the book during the class. This makes for a very inter-
active class, in which all students type all of the code too—making sure that
they understand what they are doing. Clearly, this also means that the pace
is slowed down every time a student has included a typo in their commands,
or cannot access their programs. To ease this problem, having teaching assis-
tants for the class helps immensely. Students can raise their hand, or stick
a red post-it on their computer to signal a problem. The teaching assistant
can immediately help the student and interrupt the class in case the prob-
lem is shared by multiple students—signaling the need for a more general

To allow the class to run smoothly, each student should prepare their
computer in advance.We typically circulate each chapter a week in advance of
class, encouraging the students to (a) install the software needed for the class

Introduction: Building a Computing Toolbox ● 9

and (b) read the material beforehand. Teaching assistants also offer weekly
office hours to help with the installation of software, or to discuss thematerial
and the exercises in small groups.

The “intermezzos” that are interspersed in each chapter function very
well as small in-class exercises, allowing the students to solidify their knowl-
edge, as well as highlighting potential problems with their understanding of
the material.

We encourage the students to work in groups on the exercises at the end
of each chapter, and review the solutions at the beginning of the following
class. While this can cause some difficulties in grading, we believe that work-
ing in groups is essential to overcome the challenge of the exercises, making
the students more productive, and allowing less experienced students to learn
from their peers. Publishing a blog where each group posts their solutions
reinforces the esprit de corps, creating a healthy competition between the
groups, and further instilling in the students a sense of pride for a job well
done. We also encouraged students to constructively comment on the differ-
ent approaches of other groups and discuss the challenges they’ve faced while
solving the exercises.

Another characteristic of our class has been the emphasis on the practi-
cal value of the material. For example, we ask each student to produce a final
project in which they take a boring, time-consuming task in their laboratory
(e.g., analysis of batches of data produced by laboratorymachines, calibration
of methods, other repetitive computational tasks) and completely automate
it. The student then shows their work to their labmates and scientific advisor,
and writes a short description of the program, along with the documenta-
tion necessary to use it. The goal of the final project is simply to show the
student that mastering this material can save them a lot of time—even when
accounting for the strenuous process of writing their first programs.

Wehave also experimentedwith a “flipped classroom” setting, withmixed
results. In this case, the students read thematerial at their own pace, andwork
through all the small exercises contained in the chapter. The lecture is then
devoted to working on the exercises at the end of each chapter. The lecturer
guides the discussion on the strategies that can be employed to solve the prob-
lem, sketching pseudocode on the board, and eventually producing a fully
fledged code on the computer. We have observed that, while this approach is
very rewarding for students with some prior experience in programming, it
is much less engaging for novices, who feel lost and out of touch with the rest
of the class. Probably, this would work much better if the class size were small
(less than 10 students).

Finally, we have found that leading by example serves as powerfulmotiva-
tion to students. We have always shown that we use the tools covered here for
our own research. A well-placed anecdote on Git saving the day, or showing

10 ● Chapter 0

how all the tables in a paperwere automatically generatedwith a few lines of R,
can go a longway toward convincing the students that their work studying the
material will pay off over a lifetime.

0.4 Formatting of the Book

You will find all commands and the names of packages typeset in a fixed-width
font. User-provided [INPUT] is capitalized and set between square brackets.
To execute the commands, you do not need to reproduce such formatting.
Within explanatory text, technical terms are presented in italics.

Throughout the book, we provide many code examples, enclosed in
gray boxes and typeset using fixed-width fonts. All code examples are also
provided on the companion website—but
we encourage you to type all the code in by yourself: while this might feel
slow and inefficient, the learning effect is stronger compared to simply copy-
ing and pasting, and only inspecting the result. Within the code examples,
language-specific commands are highlighted in bold.

Within the code boxes, we try to keep lines short. When we cannot avoid
a line that is longer than the width of the page we use the symbol� to indicate
that what follows should be typed in the same line as the rest.

0.5 Setup

Before you can start computing, you need to set up the environment, and
download the data and the code.

What You Need

A computer: All the software we present here is free and can be installed with a few
commands in Linux Ubuntu or Apple’s OS X; we strive to provide guidance for
Windows users. There are no specific hardware requirements. All the tools require
relatively little memory and space on your hard drive.

Software: Each chapter requires installing specific software. We have collected
detailed instructions guiding you through the installation of each tool at

A text editor: While working through the chapters, youwill write a lot of code.Much
will be written in the integrated development environments (IDEs) Jupyter and
RStudio. Sometimes, however, you will need to write code in a text editor. We

Introduction: Building a Computing Toolbox ● 11

encourage you to keep working with your favorite editor, if you already have one.
If not, please choose an editor that can support syntax highlighting for Python, R,
and LATEX. There are many options to choose from, depending on your architecture
and needs.3

Initial Setup

You can find instructions for the initial setup on our website at computing We have bundled all the data, code, exercises,
and solutions in a single download. We strongly recommend that you save
this directory in your home directory (see section 1.3.2).


C H A P T E R 1
• • • • • • • • • • • • •


1.1 What Is Unix?

Unix is an operating system, which means that it is the software that lets you
interface with the computer. It was developed in the 1970s by a group of pro-
grammers at the AT&T Bell laboratories. The new operating system was an
immediate success in academic circles, with many scientists writing new pro-
grams to extend its features. This mix of commercial and academic interest
led to the many variants of Unix available today (e.g., OpenBSD, Sun Solaris,
Apple’s OS X), collectively denoted as *nix systems. Linux is the open source
Unix clone whose “engine” (kernel) was written from scratch by Linus Tor-
valds with the assistance of a loosely knit team of hackers from across the
internet. Ubuntu is a popular Linux distribution (version of the operating

All *nix systems are multiuser, network-oriented, and store data as plain
text files that can be exchanged between interconnected computer systems.
Another characteristic is the use of a strictly hierarchical file system, discussed
in section 1.3.2.

This chapter focuses primarily on the use of the Unix shell. The shell is
the interface that is used to communicate with the core of the operating sys-
tem (kernel). It processes the commands you type, translates them for the
kernel, and shows you the results of your operations. The shell is often run
within an application called the terminal. Together, the shell and terminal are
also referred to as a command-line interface (CLI), an interface that allows
you to input commands as successive lines of text. Though technically not
correct, the terms shell, command line (interface), and terminal are often
used interchangeably. Even if you have never worked with a command-line
interface, you have surely seen one in a movie: Hollywood likes the stereo-
type of a hacker typing code in a small window with a black background (i.e.,
command-line interface).

Unix ● 13

Today, several shells are available, and here we concentrate on the most
popular one, the Bash shell, which is the default shell in Ubuntu and OS X.
Whenworking on thematerial presented in this book, it is convenient, though
not strictly necessary, to work in a *nix environment. Git Bash for Windows
emulates aUnix shell. As the nameGit Bash implies, it also uses the Bash shell.

1.2 Why Use Unix and the Shell?

Many biologists are not familiar with using *nix systems and the shell, but
rather prefer graphical user interfaces (GUIs). In a GUI, you work by inter-
acting with graphical elements, such as buttons and windows, rather than
typing commands, as in a command-line interface. While there are many
advantages to working with GUIs, working in your terminal will allow you to
automate much of your work, scale up your analysis by performing the same
tasks on batches of files, and seamlessly integrate different programs into a
well-structured pipeline.

This chapter is meant to motivate you to get familiar with command-line
interfaces, and to ease the initially steep learning curve. By working through
this chapter, you will add a tool to your toolbox that is the foundation ofmany
others—youmight be surprised to find out how natural it will become to turn
to your terminal in the future. Here are some more reasons why learning to
use the shell is well worth your effort:

First, Unix is an operating system written by programmers for program-
mers. Thismeans that it is an ideal environment for developing your code and
managing your data.

Second, hundreds of small programs are available to perform simple
tasks. These small programs can be strung together efficiently so that a single
line of Unix commands can perform complex operations, which otherwise
would require writing a long and complicated program. The ability to cre-
ate these pipelines for data analysis is especially important for biologists, as
modern research groups produce large and complex data sets whose analy-
sis requires a level of automation and reproducibility that would be hard to
achieve otherwise. For instance, imagine working with millions of files by
having to open each one of themmanually to perform an identical task, or try
opening your single 80Gb whole-genome sequencing file in software with a
GUI! In Unix, you can string a number of small programs together, each per-
forming a simple task, and create a complex pipeline that can be stored in a
script (a text file containing all the commands). Such a scriptmakes yourwork
100% reproducible. Would you be able to repeat the exact series of 100 clicks
of a complex analysis in a GUI?With a script, you will always obtain the same

14 ● Chapter 1

result! Furthermore, you will also save much time. While it may take a while
to set up your scripts, once they are in place, you can let the computer analyze
all of your data while you’re having a cup of coffee. This level of automation is
what we are striving for throughout the book, and the shell is the centerpiece
of our approach.

Third, text is the rule: If your data are stored in a text file, they can be
read and written by anymachine, and without the need for sophisticated (and
expensive) proprietary software. Text files are (and always will be) supported
by any operating system and you will still be able to access your data decades
from today (while this is not the case for most proprietary file formats). The
text-based nature of Unix might seem unusual at first, especially if you are
used to graphical interfaces and proprietary software. However, remember
that Unix has been around since the early 1970s and will likely be around at
the end of your career. Thus, the hard work you put into learning Unix will
pay off over a lifetime.

The long history of Unix means that a large body of tutorials and support
websites are readily available online. Last but not least, Unix is very stable,
robust, secure, and—in the case of Linux—freely available.

In the end, it is almost impossible for a professional scientist to entirely
avoid working in a Unix shell: the majority of high-performance computing
platforms (computer clusters, large workstations, etc.) run a Unix or Linux
operating system. Similarly, the transfer of large files, websites, and data
between machines is often accomplished through command-line interfaces.

Mastering the skills presented in this chapter will allow you to work with
large files (or with many files) effortlessly. Most operations can be accom-
plished without the need to open the file(s) in an editor, and can be automated
very easily.

1.3 Getting Started with Unix

1.3.1 Installation

The Linux distribution Ubuntu and Apple’s OS X are members of the *nix
family of operating systems. If you are using either of them, you do not need
to install any specific software to follow the material in this chapter.

Microsoft Windows is not based on a *nix system; you can, however,
recreate a Unix environment withinWindows by installing the Ubuntu oper-
ating system in a virtual machine. Alternatively, Windows users can install
Git Bash, a clone of the Bash terminal. It provides basic Unix and Git
commands and many other standard Unix functionalities can be installed.

Unix ● 15

Please find instructions for its installation in CSB/unix/installation.
Windows also ships with the program Command Prompt, a command-line
interface for Windows. However, many commands differ from their Bash
shell counterparts, so we will not cover these here.

1.3.2 Directory Structure

In Unix we speak of “directories,” while in a graphical environment the term
“folder” is more common. These two terms are interchangeable and refer
to a structure that may contain subdirectories and files. The Unix directory
structure is organized hierarchically in a tree. Figure 1.1 illustrates a direc-
tory structure in the OS X operating system. The topmost directory in the
hierarchy is also called the “root” directory and is denoted by an individ-
ual slash (/). The precise architecture varies among the different operating
systems, but there are some important directories that branch off the root
directory in most operating systems:

/bin Contains several basic programs
/dev Contains the files connecting to devices such as the keyboard, mouse,

and screen
/etc Contains configuration files
/tmp Contains temporary files

Another important directory is your home directory (also called the login
directory), which is the starting directory when you open a new shell. It con-
tains your personal files, directories, and programs. The tilde (∼) symbol is
shorthand for the home directory in Ubuntu and OS X. The exact path to
your home directory varies slightly among different operating systems. To
print its location, open a terminal and type1

echo $HOME

The command echo prints a string to the screen. The dollar sign indicates a
variable. You will learn more about variables in section 1.7.

If you followed the instructions in section 0.5, you should have created
a directory called CSB in your home directory. In Ubuntu the location is
/home/YOURNAME/CSB, inOS X it is /Users/YOURNAME/CSB.Windows users need

1. Windows users, use Git Bash or type echo %USERPROFILE% at the Windows Command

16 ● Chapter 1








Full path of the file

Home directory (~)

Root directory

... CSB

Figure 1.1. An example of the directory structure in the OS X operating system. It shows
several directories branching off the root directory (/). In OS X the home directory (∼) is a
subdirectory of Users, and inUNIX it is a subdirectory of home. If you have followed the instruc-
tions in section 0.5, you will find the directory CSB in your home directory. As an example, we
show the full path of the file

to decide where to store the directory. Within CSB, you will find several direc-
tories, one for each chapter (e.g., CSB/unix). Each of these directories contains
the following subdirectories:

installation The instructions for installing the software needed for
the chapter are contained here. These are also available

sandbox This is the directory where we work and experiment.
data This directory provides all data for the examples and exer-

cises, along with the corresponding citations for papers and


Unix ● 17

solutions The detailed solutions for the exercises are here, as well as
sketches of the solutions in plain English (pseudocode) that
you should consult if you don’t know how to proceed with an
exercise. Solutions for the “Intermezzo” sections are available
at the end of the book.

When you navigate the system, you are in one directory and can move
deeper in the tree, or upward toward the root. Section 1.4.3 discusses the com-
mands you can use to move between the hierarchical levels and determine
your location within the directory structure.

1.4 Getting Started with the Shell

In Ubuntu, you can open a shell by pressing Ctrl+Alt+T, or by opening the
dash (hold the Meta key) and typing Terminal. In OS X, you want to open
the application, which is located in the folder Utilitieswithin
Applications. Alternatively, you can type “Terminal” in Spotlight. Windows
users can launch Git Bash or another terminal emulator. In all systems, the
shell automatically starts in your home directory.

When you open a terminal, you should see a line (potentially containing
information on your user name and location), ending with a dollar ($) sign.
When you see the dollar sign, the terminal is ready to accept your commands.
Give it a try and type

# display date and time
$ date

In this book, a $ sign at the beginning of a line of code signals that the
command has to be executed in your terminal. You do not need to type the
$ sign in your terminal, only copy the command that follows it. A line start-
ing with a hash symbol (#) means that everything that follows is a comment.
While Unix ignores comments, you will find hints, reminders, and explana-
tions there. Make plenty of use of comments to document your own code.
When writing multiple lines of comments, start each with #.

In Unix, you can use the Tab key to reduce the amount you have to type,
which in turn reduces the probability of making mistakes. When you press
Tab in a (properly configured) shell, it will try to automatically complete
your command, directory, or file name. If multiple completions are possi-
ble, you can display them all by hitting the Tab key twice. Additionally, you

18 ● Chapter 1

can navigate the history of commands you have typed by using the up/down
arrows. This is very convenient as you do not need to retype a command that
you have executed recently. The following box lists keyboard shortcuts that
help pace through long lines of code.

Ctrl+A Go to the beginning of the line.

Ctrl+E Go to the end of the line.

Ctrl+L Clear the screen.

Ctrl+U Clear the line before the cursor position.

Ctrl+K Clear the line after the cursor.

Ctrl+C Kill the command that is currently running.

Ctrl+D Exit the current shell.

Alt+F Move cursor forward one word (in OS X, Esc+F).

Alt+B Move cursor backward one word (in OS X, Esc+B).

Mastering these and other keyboard shortcuts will save you a lot
of time. You may want to print this list (available at computingskillsfor and keep it next to your keyboard—when
used consistently you will have them all memorized and will start using them

1.4.1 Invoking and Controlling Basic Unix Commands

Some commands can be executed simply by typing their name:

# print a simple calendar
$ cal

However, you can pass an argument to the command to alter its

# pass argument to cal to print specific year
$ cal 2020

Unix ● 19

Some commands may require obligatory arguments and will return an
error message if they are missing. For example, the command to copy a
file needs two arguments: what file to copy, and where to copy it (see
section 1.5.1).

In addition, all commands can be modified using options that are specific
to each command. For instance, we can print the calendar in Julian format,
which labels each day with a number starting on January 1:

# use option -j to display Julian calendar
$ cal -j

Options can be written using either a dash followed by a single letter
(older style, e.g., -j) or two dashes followed by words (newer style, e.g.,
--julian). Note that not every command offers both styles.

In Unix, the placement of spaces between a command and its options
or arguments is important. There needs to be a space between the com-
mand and its options, and between multiple arguments. However, if you
are supplying multiple options to a command you can string them together
(e.g., -xvzf).

If you are using Unix commands for the first time, it might seem odd
that you usually do not get a message or response after executing a command
(unless the command itself prints to the screen). Some commands provide
feedback on their activity when you request it using an option. Otherwise,
seeing no (error) message means that your command worked.

Last but not least, it is important to know how to interrupt the execution
of a command. Press Ctrl+C to halt the command that is currently running
in your shell.

1.4.2 How to Get Help in Unix

Unix ships with hundreds of commands. As such, it is impossible to remem-
ber them all, let alone all their possible options. Fortunately, each command
is described in detail in its manual page, which OS X and Ubuntu users can
access directly from the shell by typing man [COMMAND]. Use arrows to scroll
up and down and press q to close the manual page. Users of Git Bash can
search online for unix man page [COMMAND] to find many sites displaying the
manual of Unix commands.

Checking the exact behavior of a command is especially important, given
that the shell will execute any command you type without asking whether
you know what you’re doing (so that it will promptly remove all of your files,

20 ● Chapter 1

if that’s the command you typed). You may be used to more forgiving (and
slightly patronizing) operating systems in which a pop-up window will warn
you whenever something you’re doing is considered dangerous. However,
don’t feel afraid to use the shell asmuch as possible. The really dangerous com-
mands are all very specific—there is very little chance that you will destroy
something accidentally simply by hitting the wrong key.

1.4.3 Navigating the Directory System

You can navigate the hierarchical Unix directory system using the following

cd Change directory. The command requires one argument: the path to the
directory you want to change into. There are a few options for the command
that speed up navigation through the directory structure:

cd .. Move one directory up.
cd / Move to the root directory.
cd ∼ Move to your home directory.
cd - Go back to the directory you visited previously (like “Back” in a


# assuming you saved CSB in your home directory
# navigate to the sandbox in the CSB/unix directory
cd ~/CSB/unix/sandbox

pwd Print the path of the current working directory. This command prints
your current location within the directory structure.

$ pwd


# this may look different depending on your system

ls List the files and subdirectories in the current directory. There are several
useful options:

ls -a List all files (including hidden files).

Unix ● 21

ls -l Return a long list with details on access permissions (see
section 1.6.7), the number of links to a file, the user and group
owning it, its size, the time and date it was changed last, and its

ls -lh Display file sizes in human readable units (K, M, G for kilobytes,
megabytes, gigabytes, respectively).

One can navigate through the directory hierarchy by providing either the
absolute path or a relative path. An example of the absolute path of a directory
is indicated at the bottom of figure 1.1. The full path of the file is
indicated, starting at the root. A relative path is defined with respect to the
current directory (use pwd to display the absolute path of your currentworking
directory). Let’s look at an example:

# absolute path to the directory CSB/python/data
# if CSB is in your home directory
$ cd ~/CSB/python/data

# relative path to navigate to CSB/unix/data
# remember: the Tab key provides autocomplete
$ cd ../../unix/data

# go back to previous directory (CSB/python/data)
$ cd -

You can always use either the absolute path or a relative path to specify
a directory. When you navigate just a few levels higher or deeper within the
tree, a relative path usuallymeans less to type. If youwant to jump somewhere
far within the tree, the absolute path might be the better choice.

Note that directory names in a path are separated with a forward slash (/)
in Unix but usually with a backslash (\) in Windows (e.g., when you look at
the path of a file using File Explorer). However, given that Git Bash emulates a
Unix environment, you will find it uses a forward slash despite working with
the Windows operating system.

In Unix, a full path name cannot have any spaces. Spaces in your file
or directory names need to be preceded by a backslash (\). For exam-
ple, the file “My Manuscript.txt” in the directory “Papers and reviews”
becomes Papers\ and\ reviews/My\ Manuscript.txt. To avoid such unruly
path names, an underscore (_) is recommended for separating elements in
the names of files and directories, rather than a space. If you need to refer to
an existing file or directory that has spaces in its name, use quotation marks
around it.

22 ● Chapter 1

# does not work
cd Papers and reviews

# works but not optimal
cd Papers\ and\ reviews

cd "Papers and reviews"

# when creating files or directories use
# underscores to separate elements in their names
cd Papers_and_reviews

Intermezzo 1.1
(a) Go to your home directory.
(b) Navigate to the sandbox directory within the CSB/unix directory.
(c) Use a relative path to go to the data directory within the python

(d) Use an absolute path to go to the sandbox directory within python.
(e) Return to the data directory within the python directory.

1.5 Basic Unix Commands

1.5.1 Handling Directories and Files

Creating, manipulating and deleting files or directories is one of the most
common tasks you will perform. Here are a few useful commands:

cp Copy a file or directory. The command requires two arguments: the first
argument is the file or directory that you want to copy, and the second is the
location to which you want to copy. In order to copy a directory, you need to
add the option -r which makes the command recursive. The directory and its
contents, including subdirectories (and their contents) will be copied.

# copy a file from unix/data directory into sandbox
# if you specify the full path,
# your current location does not matter
$ cp ~/CSB/unix/data/Buzzard2015_about.txt ~/CSB/unix/

� sandbox/

# assuming your current location is the unix sandbox,

Unix ● 23

# we can use a relative path
$ cp ../data/Buzzard2015_about.txt .

# the dot is shorthand to say "here"
# rename the file in the copying process
cp ../data/Buzzard2015_about.txt ./Buzzard2015_about2.txt

# copy a directory (including all subdirectories)
cp -r ../data .

mv Move or rename a file or directory. You can move a file by specifying
two arguments: the name of the file or directory you want to move, and the
destination. You can also use the mv command to rename a file or directory.
Simply specify the old and the new file name in the same location.

# move the file to the data directory
$ mv Buzzard2015_about2.txt ../data/

# rename a file
$ mv ../data/Buzzard2015_about2.txt ../data/

� Buzzard2015_about_new.txt

# easily manipulate a file that is
# not in your current working directory

touch Update the date of last access to the file. Interestingly, if the file does
not exist, this command will create an empty file.

# inspect the current contents of the directory
$ ls -l

# create a new file (you can list multiple files)
$ touch new_file.txt

# inspect the contents of the directory again
$ ls -l

# if you touch the file a second time,
# the time of last access will change

rm Remove a file. It has some useful options: rm -r deletes the contents of a
directory recursively (i.e., including all files and subdirectories in it). Use this
command with caution or in conjunction with the -i option, which prompts
the user to confirm the action. The option -f forcefully removes a write-
protected file (such as a directory under version control) without a prompt.

24 ● Chapter 1

Again, use with caution as there is no trash bin that allows you to undo the

$ rm -i new_file.txt

remove new_file.txt? y

# confirm deletion with y (yes) or n (no)

mkdir Make a directory. To create nested directories, use the option -p:

$ mkdir -p d1/d2/d3

# remove the directory by using command rm recursively
$ rm -r d1

1.5.2 Viewing and Processing Text Files

Unix was especially designed to handle text files, which is apparent when con-
sidering themultitude of commands dealing with text. Here are a few popular
ones with a selection of useful options:3

less Progressively print a file to the screen. With this command you can
instantly take a look at very large files without the need to open them. In
fact, less does not load the entire file, but only what needs to be displayed—
making it much faster than a text editor. Once you enter the less environ-
ment, you havemany options to navigate through the file or search for specific
patterns. The simplest are Ctrl+F to jump one screen forward and Ctrl+B to
jump one back. See more options by pressing h or have a look at the manual
page. Pressing q quits the less environment.4

# assuming you're in CSB/unix/data
$ less Marra2014_data.fasta

>contig00001 length=527 numreads=2 gene=isogroup00001

� status=it_thresh

3. We recommend skimming the manual pages of each command to get a sense of their full

4. Funny fact: there is a command called more that does the same thing, but with less
flexibility. Clearly, in Unix, less is more.

Unix ● 25



cat Concatenate and print files. The command requires at least one file
name as argument. If you provide only one, it will simply print the entire file
contents to the screen. Providing several files will concatenate the contents of
all files and print them to the screen.

# concatenate files and print to screen
$ cat Marra2014_about.txt Gesquiere2011_about.txt

� Buzzard2015_about.txt

wc Line, word, and byte (character) count of a file. The option -l returns the
line count only and is a quick way to get an idea of the size of a text file.

# count lines, words, and characters
$ wc Gesquiere2011_about.txt

8 64 447 Gesquiere2011_about.txt

# count lines only
$ wc -l Marra2014_about.txt

14 Marra2014_about.txt

sort Sort the lines of a file and print the result to the screen. Use option -n
for numerical sorting and -r to reverse the order. The option -k is useful to
sort a delimiter-separated file by a specific column (more on this command
in section 1.6.3).

# print the sorted lines of a file
$ sort Gesquiere2011_data.csv

100 102.56 163.06

100 117.05 158.01

100 133.4 94.78


# sort numerically
$ sort -n Gesquiere2011_data.csv

26 ● Chapter 1

maleID GC T

1 32.65 59.94

1 51.09 35.57

1 52.72 43.98


uniq Show only the unique lines of a file. The contents need to be sorted first
for this to work properly. Section 1.6.2 describes how to combine commands.
The option -c returns a count of occurrence for each unique element in the

file Determine the type of a file. Useful to identify Windows-style line
terminators5 before opening a file.

$ file Marra2014_about.txt

Marra2014_about.txt: ASCII English text

head Print the head (i.e., first few lines) of a file. The option -n determines
the number of lines to print.

# display first two lines of a file
head -n 2 Gesquiere2011_data.csv

maleID GC T

1 66.9 64.57

tail Print the tail (i.e., last few lines) of a file. The option -n controls the
number of lines to print (starting from the end of the file). The option can
also be used to display everything but the first few lines.

# display last two lines of file
$ tail -n 2 Gesquiere2011_data.csv

127 108.08 152.61

127 114.09 151.07

5. Covered in section 1.9.2.

Unix ● 27

# display from line 2 onward
# (i.e., removing the header of the file)
$ tail -n +2 Gesquiere2011_data.csv

1 66.9 64.57

1 51.09 35.57

1 65.89 114.28


diff Show the differences between two files.

Intermezzo 1.2
To familiarize yourself with these basic Unix commands, try the following:

(a) Go to the data directory within CSB/unix.
(b) How many lines are in file Marra2014_data.fasta?
(c) Create the empty file toremove.txt in the CSB/unix/sandbox directory

without leaving the current directory.
(d) List the contents of the directory unix/sandbox.
(e) Remove the file toremove.txt.

1.6 Advanced Unix Commands

1.6.1 Redirection and Pipes

So far, we have printed the output of each command (e.g., ls) directly to the
screen. However, it is easy to redirect the output to a file or to pipe the out-
put of one command as the input to another command. Stringing commands
together using pipes is the real power of Unix, letting you perform complex
tasks on large amounts of data using a single line of commands.

First, we show how to redirect the output of a command into a file:

$ [COMMAND] > filename

Note that if the file filename exists, it will be overwritten. If instead we want to
append the output of a command to an existing file, we can use the >> symbol,
as in the following line:

28 ● Chapter 1

$ [COMMAND] >> filename

When the command is very long and complex, we might want to redirect the
contents of a file as input to a command, “reversing” the flow:

$ [COMMAND] < filename

To run a few examples, let’s start by moving to our sandbox:

$ cd ~/CSB/unix/sandbox

The command echo can be used to print a string on the screen. Instead
of printing to the screen, we redirect the output to a file, effectively creating a
file containing the string we want to print:

$ echo "My first line" > test.txt

We can see the result of our operation by printing the file to the screen
using the command cat:

$ cat test.txt

To append a second line to the file, we use >>:

$ echo "My second line" >> test.txt

$ cat test.txt

We can redirect the output of any command to a file.
Here is an example: Your collaborator or laboratory machine provided

you with a large number of data files. Before analyzing the data, you want to
get a sense of how many files need to be processed. If there are thousands of
files, you wouldn’t want to count them manually or even open a file browser
that could do the counting for you. It is much simpler and faster to type a few
Unix commands.

Unix ● 29

We will use unix/data/Saavedra2013 as an example of a directory with
manyfiles. First, we create a file that lists all the files contained in the directory:

# current directory is the unix sandbox
# create a file listing the contents of a directory
$ ls ../data/Saavedra2013 > filelist.txt

# look at the file
$ cat filelist.txt

Now we want to count how many lines are in the file. We can do so by
calling the command wc -l:6

# count lines in a file
$ wc -l filelist.txt

# remove the file
$ rm filelist.txt

However, we can skip the creation of the intermediate file (filelist.txt)
by creating a short pipeline. The pipe symbol (∣) tells the shell to take the
output on the left of the pipe and use it as the input for the command on the
right of the pipe. To take the output of the command ls and use it as the input
of the command wc we can write

# count number of files in a directory
$ ls ../data/Saavedra2013 | wc -l

We have created our first, simple pipeline. In the following sections, we
are going to build increasingly long and complex pipelines. The idea is always
to start with a command and progressively add one piece after another to the
pipeline, each time checking that the result is the desired one.

1.6.2 Selecting Columns Using cut

When dealing with tabular data, you will often encounter the comma-
separated values (CSV) standard file format. As the name implies, the data
are usually structured by commas, but you may find CSV files using other

6. This is the lowercase letter L as in line.

30 ● Chapter 1

delimiters such as semicolons or tabs (e.g., because the data values contain
commas and spaces). The CSV format is text based and platform and software
independent, making it the standard output format for many experimental
devices. The versatility of the file format should also make it your preferred
choice whenmanually entering and storing data.7 Most of the exercises in this
book use the CSV file format in order to highlight how easy it is to read and
write these files using different programming languages.

The main Unix command you want to master for comma-, space-, tab-,
or character-delimited text files is cut. To showcase its features, we work with
data on the generation time of mammals published by Pacifici et al. (2013).
First, let’s make sure we are in the right directory (~/CSB/unix/data). Then,
we can print the header (the first line, specifying the contents of each column)
of the CSV file using the command head, which prints the first few lines of a
file on the screen, with the option -n 1, specifying that we want to output only
the first line:

# change directory
$ cd ~/CSB/unix/data

# display first line of file (i.e., header of CSV file)
$ head -n 1 Pacifici2013_data.csv


We can see that the data are separated by semicolons. We pipe the first
line of the file to cut and use the option -d ";" to specify the delimiter. The
additional option -f lets us extract specific columns: here column 1 (-f 1), or
the first four columns (-f 1-4).

# take first line, select 1st column of ";"-separated file
$ head -n 1 Pacifici2013_data.csv | cut -d ";" -f 1


$ head -n 1 Pacifici2013_data.csv | cut -d ";" -f 1-4


Remember to use the Tab key to autocomplete file names and the arrow
keys to access your command history.

7. If you need to store and process large data sets, you should consider databases, which we
explore in chapter 10, as an alternative.

Unix ● 31

In the next example we work with the contents of our data file. We specify
a delimiter, extract specific columns, and pipe the result to the head command
in order to display only the first few elements:

# select 2nd column, display first 5 elements
$ cut -d ";" -f 2 Pacifici2013_data.csv | head -n 5






# select 2nd and 8th columns, display first 3 elements
$ cut -d ";" -f 2,8 Pacifici2013_data.csv | head -n 3




Now, we specify the delimiter, extract the second column, skip the
first line (the header) using the tail -n +2 command (i.e., return the
whole file starting from the second line), and finally display the first five

# select 2nd column without header, show 5 first elements
$ cut -d ";" -f 2 Pacifici2013_data.csv | tail -n +2 |

� head -n 5






Wepipe the result of the previous command to the sort command (which
sorts the lines), and then again to uniq (which takes only the elements that
are not repeated).8 Effectively, we have created a pipeline to extract the
names of all the orders in the database, from Afrosoricida to Tubulidentata
(a remarkable order, which today contains only the aardvark).

8. The command uniq is typically used in conjunction with sort, as it will remove duplicate
lines only if they are contiguous.

32 ● Chapter 1

# select 2nd column without header, unique sorted elements
$ cut -d ";" -f 2 Pacifici2013_data.csv | tail -n +2 |

� sort | uniq





This type of manipulation of character-delimited files is very fast and
effective. It is an excellent idea to master the cut command in order to start
exploring large data sets without the need to open files in specialized pro-
grams. (Note that opening a file in a text editor might modify the contents of
a file without your knowledge. Find details in section 1.9.2.)

Intermezzo 1.3
(a) If we order all species names (fifth column) of Pacifici2013_

data.csv in alphabetical order, which is the first species? Which is the

(b) How many families are represented in the database?

1.6.3 Substituting Characters Using tr

Oftenwe want to substitute or remove a specific character in a text file (e.g., to
convert a comma-separated file into a tab-separated file). Such a one-by-one
substitution can be accomplished with the command tr. Let’s look at some
examples in which we use a pipe to pass a string to tr, which then processes
the text input according to the search term and specific options.

Substitute all characters a with b:

$ echo "aaaabbb" | tr "a" "b"


Substitute every digit in the range 1 through 5 with 0:

$ echo "123456789" | tr 1-5 0


Unix ● 33

Substitute lowercase letters with uppercase ones:

$ echo "ACtGGcAaTT" | tr actg ACTG


We obtain the same result by using bracketed expressions that provide
a predefined set of characters. Here, we use the set of all lowercase letters
[:lower:] and translate into uppercase letters [:upper:]:

$ echo "ACtGGcAaTT" | tr [:lower:] [:upper:]


We can also indicate ranges of characters to substitute:

$ echo "aabbccddee" | tr a-c 1-3


Delete all occurrences of a:

$ echo "aaaaabbbb" | tr -d a


“Squeeze” all consecutive occurrences of a:

$ echo "aaaaabbbb" | tr -s a


Note that the command tr cannot operate on a file “in place,” meaning
that it cannot change a file directly. However, it can operate on a copy of the
contents of a file. For instance, we can use pipes in conjunctionwith cat, head,
cut, or the output redirection operator to create input for tr:

# pipe output of cat to tr
$ cat inputfile.csv | tr " " "\t" > outputfile.csv

# redirect file contents to tr
$ tr " " "\t" < inputfile.csv > outputfile.csv

34 ● Chapter 1

In this example we replace all spaces within the file inputfile.csv with tabs.
Note the use of quotes to specify the space character. The tab is indicated by
\t. The backslash defines a metacharacter: it signals that the following char-
acter should not be interpreted literally, but rather represents a special code
referring to a character (e.g., a tab) that is difficult to represent otherwise.

Now we can apply the command tr and the commands we showcased
earlier to create a new file containing a subset of the data contained in
Pacifici2013_data.csv, which we are going to use in the next section.

First, we change directory to the sandbox:

$ cd ../sandbox/

To recap, we were working in the directory ∼/CSB/unix/data. We then
moved one directory up (..) to get to the directory ∼/CSB/unix/, from which
we moved down into the sandbox.

Now we want to create a version of Pacifici2013_data.csv contain-
ing only the Order, Family, Genus, Scientific_name, and AdultBodyMass_g
(columns 2–6). Moreover, we want to remove the header, sort the lines
according to body mass (with larger critters first), and have the values sep-
arated by spaces. This sounds like an awful lot of work, but we’re going to see
how this can be accomplished by piping a few commands together.

First, let’s remove the header:

$ tail -n +2 ../data/Pacifici2013_data.csv

Then, take only columns 2–6:

$ tail -n +2 ../data/Pacifici2013_data.csv | cut -d ";"

� -f 2-6

Now, substitute the current delimiter (;) with a space:

$ tail -n +2 ../data/Pacifici2013_data.csv | cut -d ";"

� -f 2-6 | tr ";" " "

To sort the lines according to body size, we need to exploit a few of the
options for the command sort. First, we want to sort numbers (option -n);

Unix ● 35

second, we want larger values first (option -r, reverse order); finally, we want
to sort the data according to the sixth column (option -k 6):

$ tail -n +2 ../data/Pacifici2013_data.csv | cut -d ";"

� -f 2-6 | tr ";" " " | sort -r -n -k 6

That’s it. We have created our first complex pipeline. To complete the
task, we redirect the output of our pipeline to a new file called BodyM.csv:

$ tail -n +2 ../data/Pacifici2013_data.csv | cut -d ";"

� -f 2-6 | tr ";" " " | sort -r -n -k 6 > BodyM.csv

Youmight object that the same operations could have been accomplished
with a few clicks by opening the file in a spreadsheet editor. However, suppose
you have to repeat this task many times; for example, you have to reformat
every file that is produced by a laboratory device. Then it is convenient to
automate this task such that it can be run with a single command. This is
exactly what we are going to do in section 1.7.

Similarly, suppose you need to download a large CSV file from a server,
but many of the columns are not needed. With cut, you can extract just the
relevant columns, reducing download time and storage.

1.6.4 Wildcards

Wildcards are special symbols that work as placeholders for one or more
characters. The star wildcard (*) stands for zero or more characters with the
exception of a leading dot. Unix uses a leading dot for hidden files, so this
means that hidden files are ignored in a search using this wildcard (show
hidden files using ls -a). A question mark (?) is a placeholder for any single
character, again with the exception of a leading dot.

Let’s look at some examples in the directory CSB/unix/data/miRNA:

# change into the directory
$ cd ~/CSB/unix/data/miRNA

# count the numbers of lines in all the .fasta files

36 ● Chapter 1

$ wc -l *.fasta

714 ggo_miR.fasta

5176 hsa_miR.fasta

166 ppa_miR.fasta

1320 ppy_miR.fasta

1174 ptr_miR.fasta

20 ssy_miR.fasta

8570 total

# print the first two lines of each file
# whose name starts with pp
$ head -n 2 pp*

==> ppa_miR.fasta <==

>ppa-miR-15a MIMAT0002646


==> ppy_miR.fasta <==

>ppy-miR-569 MIMAT0016013


# determine the type of every file that has
# an extension with exactly three letters
$ file *.???

1.6.5 Selecting Lines Using grep

grep is a powerful command that finds all the lines of a file that match a
given pattern. You can return or count all occurrences of the pattern in a
large text file without ever opening it. grep is based on the concept of regular
expressions, which we will cover in depth in chapter 5.

We explore the basic features of grep using the file we created in
section 1.6.3. The file contains data on thousands of species:

$ cd ~/CSB/unix/sandbox

$ wc -l BodyM.csv

5426 BodyM.csv

Let’s see how many wombats (family Vombatidae) are contained in the data.
To display the lines that contain the term “Vombatidae” we execute grepwith
two arguments—the search term and the file that we want to search in:

Unix ● 37

$ grep Vombatidae BodyM.csv

Diprotodontia Vombatidae Lasiorhinus Lasiorhinus krefftii

� 31849.99

Diprotodontia Vombatidae Lasiorhinus Lasiorhinus latifrons

� 26163.8

Diprotodontia Vombatidae Vombatus Vombatus ursinus 26000

Now we add the option -c to count the lines that contain a match:

$ grep -c Vombatidae BodyM.csv


Next, we have a look at the genus Bos in the data file:

$ grep Bos BodyM.csv

Cetartiodactyla Bovidae Bos Bos sauveli 791321.8

Cetartiodactyla Bovidae Bos Bos gaurus 721000

Cetartiodactyla Bovidae Bos Bos mutus 650000

Cetartiodactyla Bovidae Bos Bos javanicus 635974.3

Cetartiodactyla Bovidae Boselaphus Boselaphus tragocamelus

� 182253

Besides all the members of the Bos genus, we also match one member of
the genus Boselaphus. To exclude it, we can use the option -w, which prompts
grep to match only full words:

$ grep -w Bos BodyM.csv

Cetartiodactyla Bovidae Bos Bos sauveli 791321.8

Cetartiodactyla Bovidae Bos Bos gaurus 721000

Cetartiodactyla Bovidae Bos Bos mutus 650000

Cetartiodactyla Bovidae Bos Bos javanicus 635974.3

Using the option -iwe canmake the search case insensitive (it will match
both upper- and lowercase instances):

$ grep -i Bos BodyM.csv

Proboscidea Elephantidae Loxodonta Loxodonta africana

� 3824540

38 ● Chapter 1

Proboscidea Elephantidae Elephas Elephas maximus 3269794

Cetartiodactyla Bovidae Bos Bos sauveli 791321.8

Cetartiodactyla Bovidae Bos Bos gaurus 721000


Sometimes, we want to know which lines precede or follow the one we
want to match. For example, suppose we want to know which mammals
have body weight most similar to the gorilla (Gorilla gorilla). The species are
already ordered by size (see section 1.6.3), thus we can simply print the two
lines before the match using the option -B 2 and the two lines after the match
using -A 2:

$ grep -B 2 -A 2 "Gorilla gorilla" BodyM.csv

Cetartiodactyla Bovidae Ovis Ovis ammon 113998.7

Cetartiodactyla Delphinidae Lissodelphis Lissodelphis

� borealis 113000

Primates Hominidae Gorilla Gorilla gorilla 112589

Cetartiodactyla Cervidae Blastocerus Blastocerus

� dichotomus 112518.5

Cetartiodactyla Iniidae Lipotes Lipotes vexillifer

� 112138.3

Use option -n to show the line number of the match. For example, the
gorilla is the 164th largest mammal in the database:

$ grep -n "Gorilla gorilla" BodyM.csv

164:Primates Hominidae Gorilla Gorilla gorilla 112589

To print all the lines that do not match a given pattern, use the option -v.
For instance, we want to find species of the genus Gorilla other than Gorilla
gorilla. We can pipe the result of matching all members of the genus Gorilla
to a second grep statement that excludes the species Gorilla gorilla:

$ grep Gorilla BodyM.csv | grep -v gorilla

Primates Hominidae Gorilla Gorilla beringei 149325.2

To match one of several strings, use grep "[STRING1]\|[STRING2]":

Unix ● 39

$ grep -w "Gorilla\|Pan" BodyM.csv

Primates Hominidae Gorilla Gorilla beringei 149325.2

Primates Hominidae Gorilla Gorilla gorilla 112589

Primates Hominidae Pan Pan troglodytes 45000

Primates Hominidae Pan Pan paniscus 35119.95

You can use grep on multiple files at a time! Simply list all the files that
you want to search (or use wildcards to specify multiple file names). Finally,
use the recursive search option -r to search for patterns within all the files in
a directory. For example,

$ cd ~/CSB/unix

# search recursively in the data directory
$ grep -r "Gorilla" data

1.6.6 Finding Files with find

The find command is the command-line program to locate files in your
system. You can search by file name, owner, group, type, and other crite-
ria. For example, find the files and subdirectories that are contained in the
unix/data directory:

# current directory is the unix sandbox
$ find ../data








To count occurrences, we can pipe to wc -l:

$ find ../data | wc -l


40 ● Chapter 1

Now we can use find to match particular files. First, we specify where
to search: this could be either an absolute path (e.g., /home/YOURNAME/CSB/
unix/data) or a relative one (e.g., ../data, provided we’re in unix/sand-

If we want to match a specific file name, we can use the option -name:

$ find ../data -name "n30.txt"


To exploit the full power of find, we use wildcards.9 For example, use the *
wildcard to find all the files whose names contain the word about; the option
-iname ignores the case of the file name:

$ find ../data -iname "*about*"




You can specify the depth of the search by limiting it to, for example,
only the directories immediately descending from the current one. See the
difference between

$ find ../data -name "*.txt" | wc -l

64 # depending on your system


$ find ../data -maxdepth 1 -name "*.txt" | wc -l


which excluded all files in subdirectories. You can exclude certain files:

$ find ../data -not -name "*about*" | wc -l


9. See section 1.6.4 for an introduction to wildcards.

Unix ● 41

or find only directories:

$ find ../data -type d




Intermezzo 1.4
(a) Navigate to CSB/unix/sandbox. Without navigating to a different loca-

tion, find a CSV file that contains Dalziel in its file name and is located
within the CSB directory. Copy this file to the Unix sandbox.

(b) Print the first few lines on the screen to check the structure of the data.
List all unique cities in column loc (omit the header). How often does
each city occur in the data set?

(c) The fourth column reports cases of measles. What is the maximum
number of cases reported for Washington, DC?

(d) What is the maximum number of reported measles cases in the entire
data set? Where did this occur?

1.6.7 Permissions

In Unix, each file and directory has specific security attributes specifying who
can read (r), write (w), execute (x), or do nothing (-) with the file or directory.
These permissions are specified for three entities that may wish to manip-
ulate the file (owner, specific group, and others). The group level is useful
for assigning permissions to a specific group of users (e.g., administrators,
developers) but not everyone else.

Typing ls -l lists the permissions of each file or subdirectory at the
beginning of the line. Each permission is represented by a 10-character nota-
tion. The first character refers to the file type and is not related to permissions
(-means file, d stands for directory). The last 9 are arranged in groups of 3 (tri-
ads) representing the ownership groups (owner, group, others). For example,
when a file has the permission -rwxr-xr--, the owner of this file can read,
write, and execute the file (rwx), the group can read and execute (r-x), while
everyone else can only read (r--).

The commands chmod and chown change the permissions and ownership
of a file, respectively:

42 ● Chapter 1

# create a file in the unix sandbox
$ touch permissions.txt

# look at the current permissions
# (output will be different on your machine)
$ ls -l

-rw-r--r-- 1 mwilmes staff 0 Aug 15 09:47 permissions.

� txt

# change permissions (no spaces between mode settings)
$ chmod u=rwx,g=rx,o=r permissions.txt

# look at changes in permissions
$ ls -l

-rwxr-xr-- 1 mwilmes staff 0 Aug 15 09:48 permissions.

� txt

# take execute permission away from group,
# and add write rights to others
$ chmod g-x,o+w permissions.txt

$ ls -l

-rwxr--rw- 1 mwilmes staff 0 Aug 15 09:49 permissions.

� txt

Some operations, such as changing the ownership of a file or directory,
or installing new software, can be performed only by the administrator of the
machine. You, however, can do so by typing the word sudo (substitute user
do) in front of the command. The system will request a password and, if you
are authorized to use the sudo command, you grant yourself administrator

When you download and install new software, the system will often
request the administrator’s password. Pay attention to the trustworthiness
of the source of the software before confirming the installation, as you may
otherwise install malicious software.

Here is an example of changing the file permissions for a directory
recursively (i.e., for all subdirectories and files):

# create a directory with a subdirectory
$ mkdir -p test_dir/test_subdir

# look at permissions
$ ls -l

drwxr-xr-x 3 mwilmes staff 102 Aug 15 10:59 test_dir

# change owner of directory recursively using -R

Unix ● 43

$ sudo chown -R sallesina test_dir/

# check for new ownership
$ ls -l

drwxr-xr-x 3 sallesina staff 102 Aug 15 11:01 test_dir

1.7 Basic Scripting

Once a pipeline is in place, it is easy to turn it into a script. A script is a text
file containing a list of several commands. The commands are then executed
one after the other, going through the pipeline in an automated manner. To
illustrate the concept, we are going to turn the pipeline in section 1.6.3 into a

First, we need to create a file for our Unix script that we can edit using
a text editor. The typical extension for a file with shell commands is .sh. In
this example, we want to create the file, which we can open
using our favorite text editor. Create an empty file, either in the editor, or using

$ touch

Open the file in a text editor. In Ubuntu you can use, for example, gedit:

$ gedit &

In OS X, calling open will open the file with the default text editor:10

$ open &

The “ampersand” (&) at the end of the line prompts the terminal to open the
editor in the background, so that you can still use the same shell whileworking
on the file. Windows users can use any text editor.11

Now copy the pipeline that we built throughout the previous sections into
the file For now, make sure that it is one long line:

10. Use option -a to choose a specific editor (e.g., open -a emacs &).
11. Make sure, however, that the editor can save files with the Unix line terminator (LF),

otherwise the scripts will not work correctly (details in section 1.9.2). Here’s a list of suitable

44 ● Chapter 1

tail -n +2 ../data/Pacifici2013_data.csv | cut -d ";"
-f 2-6 | tr ";" " " | sort -r -n -k 6 > BodyM.csv

and save the file. To run the script, call the command bash and the file

$ bash

It is a great idea to immediately write comments for the script, to help
you remember what the code does. You can add comments using the hash
symbol (#):

# take a CSV file delimited by ";"
2 # remove the header

# make space separated
# sort according to the 6th (numeric) column

5 # in descending order
# redirect to a file
tail -n +2 ../data/Pacifici2013_data.csv | cut -d ";"

-f 2-6 | tr ";" " " | sort -r -n -k 6 > BodyM.csv

As it stands, this script is very specific: both the input file and the output
file names are fixed (hard coded). It would be better to leave these names to be
decided by the user so that the script can be called for any file with the same
format. This is easy to accomplish within the Bash shell: simply use generic
arguments (i.e., variables), indicated by the dollar sign ($), followed by the
variable name (without a space). Here, we use the number of the argument
as the variable name. When the script is run, the generic arguments within
the script are replaced by the specific argument that the user supplied when
executing the script.

Let’s change our script accordingly:

# take a CSV file delimited by ";" (first argument)
# remove the header

3 # make space separated

Unix ● 45

# sort according to the 6th (numeric) column
# in descending order

6 # redirect to a file (second argument)
tail -n +2 $1 | cut -d ";" -f 2-6 | tr ";" " " | sort -r -n

-k 6 > $2

The file name (i.e., ../data/Pacifici2013_data.csv) and the result file (i.e.,
BodyM.csv) have been replaced by $1 and $2, respectively. Now you can launch
themodified script from the command line by specifying the input and output
files as arguments:

$ bash ../data/Pacifici2013_data.csv BodyM

� .csv

The final step is tomake the script directly executable so that you can skip
invoking Bash. We can do so by changing the permissions of the file,

$ chmod +rx

and adding a special line at the beginning of the script telling Unix where to
find the program (in this case bash12) to execute the script:


# the previous line is not a comment, but a special line
# telling where to find the program to execute the script;

5 # it should be your first line in all Bash scripts

# function of script:
8 # take a CSV file delimited by ";" (first argument)

# remove the header
# make space separated

12. If you don’t know where the program bash is, you can find out by running whereis bash
in your terminal.

46 ● Chapter 1

11 # sort according to the 6th (numeric) column
# in descending order
# redirect to a file (second argument)

14 tail -n +2 $1 | cut -d ";" -f 2-6 | tr ";" " " | sort -r -n
-k 6 > $2

Now, this script can be invoked as

$ ./ ../data/Pacifici2013_data.csv BodyM.

� csv

Note the ./ in front of the script’s name in order to execute the file.
The long Unix pipe that we constructed over the last few pages can be

complicated to read and understand. It is therefore convenient to break it into
smaller pieces and save the individual output of each part as a temporary file
that can be deleted as a last step in the script:

2 # function of script:

# take a CSV file delimited by ";" (first argument)
# remove the header

5 # make space separated
# sort according to the 6th (numeric) column
# in descending order

8 # redirect to a file (second argument)

# remove the header
11 tail -n +2 $1 > $1.tmp1

# extract columns
cut -d ";" -f 2-6 $1.tmp1 > $1.tmp2

14 # make space separated
tr ";" " " < $1.tmp2 > $1.tmp3
# sort and redirect to output

17 sort -r -n -k 6 $1.tmp3 > $2
# remove temporary, intermediate files
rm $1.tmp*

This is much more readable, although a little more wasteful, as it creates tem-
porary files only then to delete them. Using intermediate, temporary files,

Unix ● 47

however, allows scripts to be “debugged” easily—just comment the last line
out and inspect the temporary files one by one to investigate at which point
you obtained an unwanted result.

1.8 Simple for Loops

A for loop allows us to repeat a task with slight variations. For instance, a loop
is very useful when you need to perform an identical task on multiple files, or
when you want to provide different input arguments for the same command.
Instead of writing code for every instance separately, we can use a loop.

As a first example, wewant to display the first two lines of all .fastafiles in
the directory CSB/unix/data/miRNA.We first change the directory and execute
the ls command to list its contents:

$ cd ~/CSB/unix/data/miRNA

$ ls

ggo_miR.fasta hsa_miR.fasta ppa_miR.fasta ...

The directory contains six .fasta files withmiRNA sequences of different
Hominidae species. Now we want to get a quick overview of the contents of
the files. Instead of individually calling the head command on each file, we
can access multiple files by writing a for loop:

$ for file in ggo_miR.fasta hsa_miR.fasta

do head -n 2 $file


>ggo-miR-31 MIMAT0002381


>hsa-miR-576-3p MIMAT0004796


Here we created a variable (file) that stands in for the actual file names that
are listed after the in. Instead of listing all files individually after the in, we
can also use wildcards to consider all .fasta files in the directory:

$ for file in *.fasta

do head -n 2 $file


48 ● Chapter 1

>ggo-miR-31 MIMAT0002381


>hsa-miR-576-3p MIMAT0004796



The actual statement (i.e., what to do with the variable) is preceded by a do.
As shown in section 1.7, the variable is invoked with a $ (dollar sign). The
statement ends with done. Instead of this clear coding style that spansmultiple
lines, youmay also encounter loops written in one line, using a ; as command
terminator instead of line breaks.

In our second example, we call a command with different input variables.
Currently, the files in CSB/unix/data/miRNAprovide files that contain different
miRNA sequences per species. However, we might need files that contain all
sequences of different species per type of miRNA. We can accomplish this by
using the command grep in a for loop instead of writing code for every type
of miRNA separately:

$ for miR in miR-208a miR-564 miR-3170

do grep $miR -A1 *.fasta > $miR.fasta


We have created the variable miR that cycles through every item in the list that
is given after the in (i.e., types of miRNA). In every iteration of the loop, one
instance of the variable is handed to grep. We used the same variable again to
create appropriate file names.

Let’s have a look at the head of one of the files that we have created:

$ head -n 5 miR-564.fasta

hsa_miR.fasta:>hsa-miR-564 MIMAT0003228



ppy_miR.fasta:>ppy-miR-564 MIMAT0016009


We can see that the output of grep is the name of the original file where a
match was found, followed by the line that contained the match. The -A1
option of grep also returned the line after the match (i.e., the sequence).

Unix ● 49

Knowing how to perform such simple loops using the Bash shell is very
beneficial. However, Bash has a rather idiosyncratic syntax that does not lend
itself well to performingmore complex programming tasks. We will therefore
cover general programming comprehensively in chapter 3, which introduces
a programming language with a friendlier syntax, Python.

1.9 Tips, Tricks, and Going beyond the Basics

1.9.1 Setting a PATH in .bash_profile

Have you come across the error message command not found? You may have
simplymistyped a command, or tried to invoke a program that is not installed
on your machine. Maybe, however, your computer doesn’t know the location
of a program, in which case this can be resolved by adding the path (loca-
tion) of a program to the PATH variable. Your computer uses $PATH to search for
corresponding executable files when you invoke a command in the terminal.
To inspect your path variable, type

# print path variable to screen
$ echo $PATH

You can append a directory name (i.e., location of a program) to your
PATH by editing your .bash_profile. This file customizes your Bash shell (e.g.,
sets the colors of your terminal or changes the command-line prompt). If this
hidden file does not exist in your home directory (check with ls -a), you
can simply create it. Here is how to append to your computer’s PATH variable

# add path to a program to computer's PATH variable

You can use which to identify the path to a program:

# identify the path to the grep command
$ which grep


Note that the order of elements in the PATH matters. If you have several
versions of a program installed on your machine, the one that is found first
(i.e., its location is represented earlier in the PATH) will be invoked.

50 ● Chapter 1

1.9.2 Line Terminators

In text files, line terminators are represented by nonprinting characters. These
are special characters that indicate white space or special formatting (e.g.,
space, tab, line break, nonbreaking hyphen). Unless you explicitly ask your
editor to display them, they will not print to the screen (hence nonprint-
ing). Unfortunately, different platforms use different symbols to indicate line
breaks. While Unix-li