Getting Started¶
Execution¶
The input files of IndeXeeker¶
IndeXeeker’s execution is usually preceded by execution of Illumina’s CASAVA/bcl2fastq or equivalent, for example, in the run 150820_D00257_0193_BC7NLTANXX:
configureBclToFastq.pl \
--use-bases-mask y*,y* \
--input-dir /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX/Data/Intensities/BaseCalls \
--output-dir /path/to/CASAVA/output/dir/
- For bcl2fastq version 2 or above::
- bcl2fastq –use-bases-mask y*,i* –create-fastq-for-index-reads –runfolder-dir /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX –output-dir /path/to/CASAVA/output/dir/
Where i* in –use-bases-mask is for read/s that defined as index in the sequence machine. You can find this information in a file RunInfo.xml within the directory: /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX.
Which is then followed by (in bcl2fastq version 1 only):
make -C /path/to/CASAVA/output/dir/ -j <num-jobs>
The parameter –use-bases-mask y*,y* guides bcl2fastq to only perform base calling, leaving demultiplexing for IndeXeeker.
Note: If the flow cell is Paired-end, then parameter –use-bases-mask y*,y*,y* should be used instead, or y*,i*,y* for bcl2fastq version 2 or above.
bcl2fastq’s output will then serve as input to IndeXeeker. IndeXeeker can get as input uncompressed or compressed files (in gz format) that are accepted from bcl2fastq version 1, But meantime, don’t support in bgzf format of bcl2fastq version 2, so you will need use with –no-bgzf-compression parameter of bcl2fastq to get the output in gz format, or, more efficiently, uncompress the bcl2fastq output files before the using with IndeXeeker
Following IndeXeeker’s installation, two command-line scripts are installed named indexeeker-demultiplex.py and indexeeker-prepare-barcodesheet.py.
- To see all command-line parameters, run::
- indexeeker-prepare-barcodesheet.py -h indexeeker-demultiplex.py -h
BarcodeSheet¶
BarcodeSheet.csv is IndeXeeker’s equivalent to bcl2fastq’s SampleSheet.csv but is more elaborate so more complex indexes can be described. As in bcl2fastq, the barcode sheet directs IndeXeeker how to assign reads to samples,and samples to projects. The script indexeeker-demultiplexing gets the BarcodeSheet.csv file as input. You can prepare the BarcodeSheet.csv file manually, or use with the script indexeeker-prepare-barcodesheet.py for the prepairing the BarcodeSheet.csv file by automatically detection of the barcodes features and their locations on the reads.
The fields of BarcodeSheet.csv file:
Column | Description |
---|---|
lane | Positive integer indicating the lane number (1-8), as in bcl2fastq. Required |
project_name | The project the sample belongs to, as in bcl2fastq. Required |
sample_name | Sample name, as in bcl2fastq. Required |
sub_sample_name | Unique name for samples with identical sample name. The output files of these samples will be grouped into sub folders under the same folder with the sample_name. Optional. |
tag1_sequence | the first barcode sequence. If no barcode exists, write NoIndex. Required |
tag2_sequence | the second barcode sequence. Optional |
tag3_sequence | as above |
tag1_name | tag name which will be used for output file name. We usually use tag1_sequence as tag1_name. Required |
tag2_name | Same as tag1_name for second tag. Required when tag2_sequence is provided |
tag3_name | Same as tag1_name for third tag. Required when tag3_sequence is provided |
master_tag | Integer (1-3). When using the same barcode for all sub samples under same sample_name, you can declare the common barcode as a master tag, so that if one of the other barcodes was not identified but the master barcode was identified, the reads will not belong to the general undetermined reads but to the local undetermined under the sample_name folder |
cut_tag1 | whether to cut the barcode sequence from sequences for read 1 (yes/no). Required |
cut_tag2 | Same as cut_tag1 for read #2 (yes/no). Required when tag2_sequence is provided |
cut_tag3 | Same as cut_tag1 for read #3 (yes/no). Required when tag3_sequence is provided |
maximal_mismatches_tag1 | maximal number of mismatches in tag #1. Integer. Required |
maximal_mismatches_tag2 | maximal number of mismatches in tag #2. Integer. Required when tag2_sequence is provided |
maximal_mismatches_tag3 | maximal number of mismatches in tag #3. Integer. Required when tag3_sequence is provided |
maximal_offset_tagged_read | enable offset (to left side or right side) in the location of the first barcode on the read from its planned location. The format is: (int)l(int)r. for example: 3l5r will enable offset of barcode until 3 bases toward left side of the read (3’) and 5 bases toward right side (5’) No offset is marked as 0l0r. Required |
If you choose to create the BarcodeSheet.csv file automatically by indexeeker-prepare-barcodesheet.py script, you need create BarcodeList.csv file that contains only the following fields:
Column | Description |
---|---|
lane | Positive integer indicating the lane number (1-8), as in bcl2fastq. Required |
sample_name | Sample name, as in bcl2fastq. Required |
tag1_sequence | the first barcode sequence. If no barcode exists, write NoIndex. Required |
tag2_sequence | the second barcode sequence. Optional |
tag3_sequence | as above |
project_name | The project the sample belongs to, as in bcl2fastq. Required |
BarcodeList example¶
Here is barcodeList example - for samples which are represented by 2 barcodes:
BarcodeList.csv for samples which are represented by 2 barcodes
BarcodeSheet examples¶
Here are a couple of barcode sheet examples - one for single-read and one for paired-end:
BarcodeSheet.csv for Single-read with read lengths 51, 7
BarcodeSheet.csv for Paired-end with read lengths 101, 7, 101
Usage examples¶
Here is an example execution of indexeeker-prepair-barcodesheet.py:
indexeeker-prepair-barcodesheet.py --casava-output-dir /path/to/bcl2fastq/output/dir/ \
--barcode-list /path/to/BarcodeList.csv \
--barcode-sheet-output /path/to/BarcodeSheet.csv \
--lanes 1
Here is an example execution of indexeeker-demultiplex.py:
indexeeker-demultiplex.py --casava-output-dir /path/to/bcl2fastq/output/dir/ \
--barcode-sheet /path/to/BarcodeSheet.csv \
--output-dir /path/to/IndeXeeker/output/dir/ \
--lanes 1
Concurrency¶
Since running IndeXeeker on a single lane may take up a lot of memory, it is possible to use it in the following manner:
# Run 8 processes of CASAVA/bcl2fastq simultaneously, one per lane. Use the –tiles parameter to run each process on a different lane, for example: s_3_* to run bcl2fastq on lane 3. Give each such process a different output directory:
configureBclToFastq.pl --use-bases-mask y*,y*,y* \
--input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \
--output-dir /path/to/CASAVA/output/dir/lane_1/ \
--tiles s_1_*
configureBclToFastq.pl --use-bases-mask y*,y*,y* \
--input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \
--output-dir /path/to/CASAVA/output/dir/lane_2/ \
--tiles s_2_*
[...]
configureBclToFastq.pl --use-bases-mask y*,y*,y* \
--input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \
--output-dir /path/to/CASAVA/output/dir/lane_8/ \
--tiles s_8_*
# Run 8 processes of IndeXeeker simultaneously, each running on a different bcl2fastq output directory as its input directory.
Note: These processes can share output directory. as shown below. Collisions are avoided since created FASTQ files contain the lane number:
indexeeker-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \
--casava-output-dir /path/to/CASAVA/output/dir/lane_1/ \
--output-dir /path/to/IndeXeeker/output/dir/ \
--lanes 1
indexeeker-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \
--casava-output-dir /path/to/CASAVA/output/dir/lane_2/ \
--output-dir /path/to/IndeXeeker/output/dir/ \
--lanes 2
[...]
indexeeker-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \
--casava-output-dir /path/to/CASAVA/output/dir/lane_8/ \
--output-dir /path/to/IndeXeeker/output/dir/ \
--lanes 8