Project Templates30 May 2017 | infrastructure dspg dspg17 project_template sdal
Project templates provide some standardized way to organize files. Our lab uses a template that is based off the Noble 2009 Paper, “A Quick Guide to Organizing Computational Biology Projects”. I’ve created a simple shell script that automatically generates this folder structure here, and there’s an rr-init project by the Reproducible Science Curriculum folks.
The structure we have in our lab looks like this:
project | |- data # raw and primary data, are not changed once created | | | |- project_data # subfolder that links to an encrypted data storage container | | | | | |- original # raw data, will not be altered | | |- working # intermediate datasets from src code | + +- final # datasets used in analysis | |- src / # any programmatic code | |- user1 # user1 assigned to the project | +- user2 # user2 assigned to the project | |- output # all output and results from workflows and analyses | |- figures/ # graphs, likely designated for manuscript figures | |- pictures/ # diagrams, images, and other non-graph graphics | +- analysis/ # generated reports for (e.g. rmarkdown output) | |- README.md # the top level description of content | |- Makefile # Makefile, if applicable |- .gitignore # git ignore file +- project.Rproj # RStudio project
In the main level there are
As well as
.gitignore, and potentially a
Since we are primarily an R lab that runs an RStudio Server server,
we use the Rproj files organize the various projects.
There are a few benefits to this.
When using the
it sets the working directory in RStudio to the location of the
Rproj file automatically.
This makes the project more reproducible by avoiding the
setwd() command in R,
and since multiple people work on the same project, referencing other people’s source code and data outputs
all stem from a common location.
.gitignore file is there to ignore various outputs from the
This includes things like
.html output from
as well as things in the first level of the
the files and folders in the
.gitignore file are things that can be reproduced/regenerated by running code from the
src folder contains all the analysis and code for the project.
It should only contain the code for the project and not any kind of output from the code, i.e., data, reports, etc..
Since all the projects int he lab are separate
each person working on the project creates a separate folder with his/her user name (e.g.
user2) under the
src directory to minimize
potential conflicts within the code.
Is there for any type of ‘final’ non-data output.
The useage is ambiguous on purpose,
but typically is used for some kind of plot or table that will be used in a final publication or report.
analysis subfolders are just placeholders about what could potentially be placed in the folder,
users have the freedom to adapt the contents to the project at hand.
Things in the
output folder are, by default,
ignored since the they should be able to be re-created with one of the
The only thing that should not be in the output folder are any datasets.
Those should all be under of the the
data subfolders described below.
Since the data folder is part of the code repository, (i.e., it comes when you
git clone the repository),
the contents of the folder are, by default, ignored in the
Additionally, because of data privacy concerns, all of our project data are on separate (encrypted) LUKS (Linux Unified Key Setup)
data folder contains a shortcut to the relevant encrypted data container.
This is one way to prevent data from being checked into the code repository,
and potentially leaving the server.
Within the encrypted data folder, there are 3 main folders:
original data are the rawest datasets available.
Typically theses are datasets we are given by sponsors,
or found online.
These datasets, in combination with the code in
src, should be able to regenerate any of the datasets in
Data provenance is the chronology of how data is transformed through the cleaning and analysis phase.
It’s important for reproducibility/reproducibility, and means
original data should never be altered directly.
original data should only be modified by the code in
Also, because the
original dataset is never altered,
and bugs or alterations in the code can be fixed without compromising the integrity of the dataset.
working data folder is mainly used for intermediate datasets.
For example, when a particular data step takes “a long time” to run,
the output of that datastep can be saved in the
and be used in a new
R script to resume any additional data cleaning steps.
final data folder is usually used for datasets that have been cleaned and ready for analysis.
No dataset is every fully cleaned, you can probably always perform some other data transformation on it,
but this folder is mainly reserved for datasets where an analysis, report, or plot is generated from.
Project templates provide a standard for one to share code with other people. With a standardized folder structures, a new member of a project can easily start to understand where the data, code, documentation, and results are.
It also makes code reproducible/replicable and provides a common location (working directory) to run the code.
Finally, because there are specifically designated areas for various components of a project, things become easier to find because everything is not simply placed in the same folder for “convenience”.