Perpetually Under Construction

Project Templates

|

Project templates provide some standardized way to organize files. Our lab uses a template that is based off the Noble 2009 Paper, “A Quick Guide to Organizing Computational Biology Projects”. I’ve created a simple shell script that automatically generates this folder structure here, and there’s an rr-init project by the Reproducible Science Curriculum folks.

The structure we have in our lab looks like this:

project
|
|- data             # raw and primary data, are not changed once created
|  |
|  |- project_data  # subfolder that links to an encrypted data storage container
|  |  |
|  |  |- original   # raw data, will not be altered
|  |  |- working    # intermediate datasets from src code
|  +  +- final      # datasets used in analysis
|
|- src /            # any programmatic code
|  |- user1         # user1 assigned to the project
|  +- user2         # user2 assigned to the project
|
|- output           # all output and results from workflows and analyses
|  |- figures/      # graphs, likely designated for manuscript figures
|  |- pictures/     # diagrams, images, and other non-graph graphics
|  +- analysis/     # generated reports for (e.g. rmarkdown output)
|
|- README.md        # the top level description of content
|
|- Makefile         # Makefile, if applicable
|- .gitignore       # git ignore file
+- project.Rproj    # RStudio project

In the main level there are data, src, and output folders. As well as *.Rproj, .gitignore, and potentially a Makefile files.

The .Rproj file

Since we are primarily an R lab that runs an RStudio Server server, we use the Rproj files organize the various projects. There are a few benefits to this. When using the RProj file, it sets the working directory in RStudio to the location of the Rproj file automatically. This makes the project more reproducible by avoiding the setwd() command in R, and since multiple people work on the same project, referencing other people’s source code and data outputs all stem from a common location.

The .gitignore file

The .gitignore file is there to ignore various outputs from the src code. This includes things like .pdf or .html output from knitr and rmarkdown documents, as well as things in the first level of the data folder. In general, the files and folders in the .gitignore file are things that can be reproduced/regenerated by running code from the src folder.

The src folder

The src folder contains all the analysis and code for the project. It should only contain the code for the project and not any kind of output from the code, i.e., data, reports, etc.. Since all the projects int he lab are separate git repositories, each person working on the project creates a separate folder with his/her user name (e.g. user1, user2) under the src directory to minimize potential conflicts within the code.

The output folder

Is there for any type of ‘final’ non-data output. The useage is ambiguous on purpose, but typically is used for some kind of plot or table that will be used in a final publication or report. the figures, pictures, and analysis subfolders are just placeholders about what could potentially be placed in the folder, users have the freedom to adapt the contents to the project at hand.

Things in the output folder are, by default, ignored since the they should be able to be re-created with one of the src scripts.

The only thing that should not be in the output folder are any datasets. Those should all be under of the the data subfolders described below.

The data folder

Since the data folder is part of the code repository, (i.e., it comes when you git clone the repository), the contents of the folder are, by default, ignored in the .gitignore file. Additionally, because of data privacy concerns, all of our project data are on separate (encrypted) LUKS (Linux Unified Key Setup) partitions. The data folder contains a shortcut to the relevant encrypted data container. This is one way to prevent data from being checked into the code repository, and potentially leaving the server.

Within the encrypted data folder, there are 3 main folders: original, working, and final. The original data are the rawest datasets available. Typically theses are datasets we are given by sponsors, or found online. These datasets, in combination with the code in src, should be able to regenerate any of the datasets in working and final.

Data provenance is the chronology of how data is transformed through the cleaning and analysis phase. It’s important for reproducibility/reproducibility, and means that original data should never be altered directly. original data should only be modified by the code in src. Also, because the original dataset is never altered, and bugs or alterations in the code can be fixed without compromising the integrity of the dataset.

The working data folder is mainly used for intermediate datasets. For example, when a particular data step takes “a long time” to run, the output of that datastep can be saved in the working directory and be used in a new R script to resume any additional data cleaning steps.

The final data folder is usually used for datasets that have been cleaned and ready for analysis. No dataset is every fully cleaned, you can probably always perform some other data transformation on it, but this folder is mainly reserved for datasets where an analysis, report, or plot is generated from.

Conclusion

Project templates provide a standard for one to share code with other people. With a standardized folder structures, a new member of a project can easily start to understand where the data, code, documentation, and results are.

It also makes code reproducible/replicable and provides a common location (working directory) to run the code.

Finally, because there are specifically designated areas for various components of a project, things become easier to find because everything is not simply placed in the same folder for “convenience”.

Comments