Welcome to grape.recipe.pipeline's documentation! ================================================= Grape (Grape RNA-Seq Analysis Pipeline Environment) is a pipeline for processing and analyzing RNA-Seq data developed at the Bioinformatics and Genomics unit of the `Centre for Genomic Regulation (CRG) `_ in Barcelona. The ``grape.buildout`` package makes use of the ``grape.recipe.pipeline`` recipe to configure Grape pipelines. You get preconfigured start and execute scripts, and don't have to worry about command line options any more. This makes configuring multiple Grape pipelines more convenient. To learn more about Grape, and to download and install it, go to the Bioinformatics and Genomics website at: `Grape Homepage `_ .. note:: The ``grape.recipe.pipeline`` package is a `Buildout `_ recipe used by ``grape.buildout``, and is not a standalone Python package. It is only going to be useful as installed by the ``grape.buildout`` package. Motivation ---------- Here at the CRG, we configure all our RNASeq pipeline runs in a central place before running them. Once all the accessions and pipeline profiles have been defined and the buildout parts have been created, we start and execute them on an `SGE `_ cluster. When we receive ``FASTQ`` or ``BAM`` files for a project, we typically have to: 1. Define the accessions and profiles:: grape.buildout/accessions/MyProject/db.cfg grape.buildout/profiles/MyProject/db.cfg 2. Create a pipeline project folder:: grape.buildout/pipelines/MyProject 3. Configure the buildout:: grape.buildout/pipelines/MyProject/buildout.cfg 4. Run the buildout in:: grape.buildout/pipelines/MyProject 5. Run the pipelines in:: grape.buildout/pipelines/MyProject/parts/*/ The ``grape.recipe.pipeline`` recipe plays a major role in step number 4. The buildout uses the recipe to produce the individual pipelines and preconfigure the start and execute scripts with all the necessary command line options. Accession Configuration Parameters ---------------------------------- The accession parameters are mostly derived from `UCSC's ENCODE controlled vocabulary `_. The following parameters are used when configuring accessions: ================= ===================================================== ``file_location`` The full path to the file If there is more than one file, put one file per line ================= ===================================================== For each line in ``file_location``, a corresponding line needs to be specified for the following parameters: ================= =================================================================== ``mate_id`` Using this label, files can be marked as belonging to one read ``pair_id`` Using this label, files can be associated with pairs ``label`` Using this label, files can be associated with the accession itself ================= =================================================================== The pairing information is required: ========== ========================================================== ``paired`` Set this to ``0`` for unpaired reads, an ``1`` for paired. ========== ========================================================== Instead of risking to wrongly deduce the file type from the file extensions contained in the ``file_location`` parameter, it should be given explicitly here. ============= ========================================================== ``type`` Set the file type. This can be set to ``fastq`` or ``bam`` ============= ========================================================== The qualities and the read type are important, because depending on them Grape may produce different results: ============= ======================================================================== ``qualities`` The base quality can be se to ``phred`` and ``solexa``, or if you don't know the quality, you can set it to ``ignore`` ``readType`` Paired/Single reads lengths: Specific information about cDNA sequence reads including length, directionality and single versus paired read. `See UCSC's ENCODE controlled vocabulary for readType `_. ============= ======================================================================== The ``replicate`` and ``species`` parameters don't change anything in the behaviour of Grape, but are considered essential meta data. ============= ========================================================================= ``replicate`` The biological replicate of a particular experiment, if the experiment is a bioreplicate ``species`` The species, for example ``Homo sapiens``, ``Mus musculus`` ============= ========================================================================= The following parameters have been used in the ENCODE project. It makes sense to fill in the ``cell`` parameters and the ``rnaExtract`` as well. For most projects, the ``localization`` is probably going to be 'cell'. ================ ============================================================================ ``cell`` Cell, tissue or DNA sample: Cell line or tissue used as the source of experimental material. `See UCSC's ENCODE controlled vocabulary for cell `_. ``rnaExtract`` RNA Extract: Fraction of total cellular RNA selected for by an experiment. This includes size fractionation (long versus short) and feature fractionation (PolyA-, PolyA+, rRNA-). `See UCSC's ENCODE controlled vocabulary for rnaExtract `_. ``localization`` Cellular compartment: The cellular compartment from which RNA is extracted. Primarily used by the Transcriptome Project. `See UCSC's ENCODE controlled vocabulary for localization `_. ================ ============================================================================ Example Accession Configuration ------------------------------- Here's a complete example of how the pipelines are configured, taken from the Test project in grape.buildout. First we define an accession in:: accession/Test/db.cfg This is the content of the db.cfg file:: [TestRun] species = Homo sapiens readType = 2x76 cell=NHEK rnaExtract=LONGPOLYA localization=CELL replicate=1 qualities=solexa type=fastq file_location = ${buildout:directory}/src/testdata/testA.r2.fastq.gz ${buildout:directory}/src/testdata/testA.r1.fastq.gz ${buildout:directory}/src/testdata/testB.r2.fastq.gz ${buildout:directory}/src/testdata/testB.r1.fastq.gz mate_id = testA.2 testA.1 testB.2 testB.1 pair_id = testA testA testB testB label = Test Test Test Test type = fastq paired = 1 Profiles Configuration Parameters --------------------------------- The following parameters are configured in the ``profiles`` folder, and specify the general parameters of the Grape pipeline. The project id should be as short as possible. ================================= =================== ``PROJECTID`` Name of the project ================================= =================== There are two predefined pipeline templates, one for fastq files as input and one for bam files. ================================= =================================================== ``TEMPLATE`` Path to the template defining the pipeline steps For ``FASTQ`` files as input: ${buildout:directory}/src/pipeline/template3.0.txt For ``BAM`` files as input: ${buildout:directory}/src/pipeline/template.bam.txt ================================= =================================================== There are some technical settings that need to be made so that the results are written to the right databases. ================================= =============================== ``DB`` Statistic results database name ``COMMONDB`` Meta data Database name ``HOST`` MySQL database host name ``CLUSTER`` Name of the cluster node to use ================================= =============================== You can fine-tune the number of threads to be used for any program that can make use of threads, like for example GEM. There is also a setting for the amount of memory to use for the Flux. ================================= =================================================== ``THREADS`` Number of threads to use ``FLUXMEM`` Configures the memory used by the Flux. The default value is ``16G`` ================================= =================================================== The mapper and the number of mismatches can be set. ================================= ================================================= ``MAPPER`` This currently has to be set to the value ``GEM`` ``MISMATCHES`` Number of mismatches for the mapper ================================= ================================================= The genome and annotation files need to be specified. ================================= =============== ``GENOMESEQ`` Genome file ``ANNOTATION`` Annotation file ================================= =============== Preprocessing the reads should be done on the fly. The most common preprocessing step is trimming, so there is one setting for the trim length. You can also specify your own preprocessing script. ================================= ====================================================== ``PREPROCESS_TRIM_LENGTH`` A preprocessing step that trims the reads by the given nucleotide length ``PREPROCESS`` Path to a custom script used for preprocessing each of the read files before anything else ================================= ====================================================== You can customize the way the recursive mapping is done, as well as how the postprocessing is done on some files. ===================================== ======================================================== ``MIN_RECURSIVE_MAPPING_TRIM_LENGTH`` Tunes the minimum length to which a read will be trimmed during the recursive mapping. ``MAXINTRONLENGTH`` Sets the maximum length of splits allowed during the postprocessing of the files generated by ``gem-2-sam`` removing the noise. The default is set to ``50k``, which is reasonable in mammals, however different species may require different settings. Setting it to ``0`` will remove this filter. ===================================== ======================================================== Example Profile Configuration ----------------------------- Then we need to define the pipeline runs in:: profiles/MyProject/db.cfg This is the content of the ``db.cfg`` file:: [runs] parts = TestRun [pipeline] TEMPLATE = ${buildout:directory}/src/pipeline/template3.0.txt PROJECTID = Test DB = Test_RNAseqPipeline COMMONDB = Test_RNAseqPipelineCommon THREADS = 8 MAPPER = GEM MISMATCHES = 2 CLUSTER = mem_6 ANNOTATION = ${buildout:directory}/src/testdata/H.sapiens.EnsEMBL.55.test.gtf GENOMESEQ = ${buildout:directory}/src/testdata/H.sapiens.genome.hg19.test.fa [TestRun] recipe=grape.recipe.pipeline accession = TestRun Now that we have the accessions and profiles defined, we can go to our project folder and define the ``buildout.cfg`` that will produce our Grape pipelines:: pipelines/Test/buildout.cfg The ``buildout.cfg`` should look like this:: [buildout] extends = ../dependencies.cfg ../../accessions/Test/db.cfg ../../profiles/Test/db.cfg There are pointers to the accession and profile. The dependencies file takes care of installing all the dependencies, like overlap, flux, gem, and the Grape pipeline. Contents: .. toctree:: :maxdepth: 2 Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`