Getting Started

Execution

The input files of IndeXeeker

IndeXeeker’s execution is usually preceded by execution of Illumina’s CASAVA/bcl2fastq or equivalent, for example, in the run 150820_D00257_0193_BC7NLTANXX:

configureBclToFastq.pl \
  --use-bases-mask y*,y* \
  --input-dir /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX/Data/Intensities/BaseCalls \
  --output-dir /path/to/CASAVA/output/dir/
For bcl2fastq version 2 or above::
bcl2fastq –use-bases-mask y*,i* –create-fastq-for-index-reads –runfolder-dir /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX –output-dir /path/to/CASAVA/output/dir/

Where i* in –use-bases-mask is for read/s that defined as index in the sequence machine. You can find this information in a file RunInfo.xml within the directory: /path/to/HiSeq/output/dir/150820_D00257_0193_BC7NLTANXX.

Which is then followed by (in bcl2fastq version 1 only):

make -C /path/to/CASAVA/output/dir/ -j <num-jobs>

The parameter –use-bases-mask y*,y* guides bcl2fastq to only perform base calling, leaving demultiplexing for IndeXeeker.

Note: If the flow cell is Paired-end, then parameter –use-bases-mask y*,y*,y* should be used instead, or y*,i*,y* for bcl2fastq version 2 or above.

bcl2fastq’s output will then serve as input to IndeXeeker. IndeXeeker can get as input uncompressed or compressed files (in gz format) that are accepted from bcl2fastq version 1, But meantime, don’t support in bgzf format of bcl2fastq version 2, so you will need use with –no-bgzf-compression parameter of bcl2fastq to get the output in gz format, or, more efficiently, uncompress the bcl2fastq output files before the using with IndeXeeker

Following IndeXeeker’s installation, two command-line scripts are installed named indexeeker-demultiplex.py and indexeeker-prepare-barcodesheet.py.

To see all command-line parameters, run::
indexeeker-prepare-barcodesheet.py -h indexeeker-demultiplex.py -h

BarcodeSheet

BarcodeSheet.csv is IndeXeeker’s equivalent to bcl2fastq’s SampleSheet.csv but is more elaborate so more complex indexes can be described. As in bcl2fastq, the barcode sheet directs IndeXeeker how to assign reads to samples,and samples to projects. The script indexeeker-demultiplexing gets the BarcodeSheet.csv file as input. You can prepare the BarcodeSheet.csv file manually, or use with the script indexeeker-prepare-barcodesheet.py for the prepairing the BarcodeSheet.csv file by automatically detection of the barcodes features and their locations on the reads.

The fields of BarcodeSheet.csv file:

Column Description
lane Positive integer indicating the lane number (1-8), as in bcl2fastq. Required
project_name The project the sample belongs to, as in bcl2fastq. Required
sample_name Sample name, as in bcl2fastq. Required
sub_sample_name Unique name for samples with identical sample name. The output files of these samples will be grouped into sub folders under the same folder with the sample_name. Optional.
tag1_sequence the first barcode sequence. If no barcode exists, write NoIndex. Required
tag2_sequence the second barcode sequence. Optional
tag3_sequence as above
tag1_name tag name which will be used for output file name. We usually use tag1_sequence as tag1_name. Required
tag2_name Same as tag1_name for second tag. Required when tag2_sequence is provided
tag3_name Same as tag1_name for third tag. Required when tag3_sequence is provided
master_tag Integer (1-3). When using the same barcode for all sub samples under same sample_name, you can declare the common barcode as a master tag, so that if one of the other barcodes was not identified but the master barcode was identified, the reads will not belong to the general undetermined reads but to the local undetermined under the sample_name folder
cut_tag1 whether to cut the barcode sequence from sequences for read 1 (yes/no). Required
cut_tag2 Same as cut_tag1 for read #2 (yes/no). Required when tag2_sequence is provided
cut_tag3 Same as cut_tag1 for read #3 (yes/no). Required when tag3_sequence is provided
maximal_mismatches_tag1 maximal number of mismatches in tag #1. Integer. Required
maximal_mismatches_tag2 maximal number of mismatches in tag #2. Integer. Required when tag2_sequence is provided
maximal_mismatches_tag3 maximal number of mismatches in tag #3. Integer. Required when tag3_sequence is provided
maximal_offset_tagged_read enable offset (to left side or right side) in the location of the first barcode on the read from its planned location. The format is: (int)l(int)r. for example: 3l5r will enable offset of barcode until 3 bases toward left side of the read (3’) and 5 bases toward right side (5’) No offset is marked as 0l0r. Required

If you choose to create the BarcodeSheet.csv file automatically by indexeeker-prepare-barcodesheet.py script, you need create BarcodeList.csv file that contains only the following fields:

Column Description
lane Positive integer indicating the lane number (1-8), as in bcl2fastq. Required
sample_name Sample name, as in bcl2fastq. Required
tag1_sequence the first barcode sequence. If no barcode exists, write NoIndex. Required
tag2_sequence the second barcode sequence. Optional
tag3_sequence as above
project_name The project the sample belongs to, as in bcl2fastq. Required

BarcodeList example

Here is barcodeList example - for samples which are represented by 2 barcodes:

BarcodeList.csv for samples which are represented by 2 barcodes

BarcodeSheet examples

Here are a couple of barcode sheet examples - one for single-read and one for paired-end:

BarcodeSheet.csv for Single-read with read lengths 51, 7

BarcodeSheet.csv for Paired-end with read lengths 101, 7, 101

Usage examples

Here is an example execution of indexeeker-prepair-barcodesheet.py:

indexeeker-prepair-barcodesheet.py --casava-output-dir  /path/to/bcl2fastq/output/dir/ \
    --barcode-list /path/to/BarcodeList.csv \
    --barcode-sheet-output /path/to/BarcodeSheet.csv \
    --lanes 1

Here is an example execution of indexeeker-demultiplex.py:

indexeeker-demultiplex.py --casava-output-dir  /path/to/bcl2fastq/output/dir/ \
    --barcode-sheet /path/to/BarcodeSheet.csv \
    --output-dir /path/to/IndeXeeker/output/dir/ \
    --lanes 1

Concurrency

Since running IndeXeeker on a single lane may take up a lot of memory, it is possible to use it in the following manner:

# Run 8 processes of CASAVA/bcl2fastq simultaneously, one per lane. Use the –tiles parameter to run each process on a different lane, for example: s_3_* to run bcl2fastq on lane 3. Give each such process a different output directory:

configureBclToFastq.pl --use-bases-mask y*,y*,y* \
    --input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \
    --output-dir /path/to/CASAVA/output/dir/lane_1/ \
    --tiles s_1_*
configureBclToFastq.pl --use-bases-mask y*,y*,y* \
    --input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \
    --output-dir /path/to/CASAVA/output/dir/lane_2/ \
    --tiles s_2_*
[...]
configureBclToFastq.pl --use-bases-mask y*,y*,y* \
    --input-dir /path/to/HiSeq/output/dir/150719_D00257_0190_BC6FFPANXX/Data/Intensities/BaseCalls/ \
    --output-dir /path/to/CASAVA/output/dir/lane_8/ \
    --tiles s_8_*

# Run 8 processes of IndeXeeker simultaneously, each running on a different bcl2fastq output directory as its input directory.

Note: These processes can share output directory. as shown below. Collisions are avoided since created FASTQ files contain the lane number:

indexeeker-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \
    --casava-output-dir /path/to/CASAVA/output/dir/lane_1/ \
    --output-dir /path/to/IndeXeeker/output/dir/ \
    --lanes 1
indexeeker-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \
    --casava-output-dir /path/to/CASAVA/output/dir/lane_2/ \
    --output-dir /path/to/IndeXeeker/output/dir/ \
    --lanes 2
[...]
indexeeker-demultiplex.py --barcode-sheet /path/to/BarcodeSheet.csv \
    --casava-output-dir /path/to/CASAVA/output/dir/lane_8/ \
    --output-dir /path/to/IndeXeeker/output/dir/ \
    --lanes 8