Analyser la qualité avec fastQC#

Fast Quality Control (FastQC)

Propose un certain nombre de diagrammes pour évaluer la qualité du séquençage.

fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] fq1 fq2 ...

# On remonte d’un niveau dans l’arborescence
cd /shared/projects/2325_ebaii/coursLinux/demo/chip-seq/                                   
?2004h

# On créé un répertoire
mkdir -p qc
# 2 instructions sur la même ligne séparées par ‘;’
ls -l ; cd qc
total 16
drwxrwx---+ 2 pfrancois pfrancois 4096 Nov  4 13:05 fastq
drwxrwx---+ 3 pfrancois pfrancois 4096 Nov  4 13:03 output
drwxrwx---+ 2 pfrancois pfrancois 4096 Nov  4 13:05 qc
drwxrwx---+ 2 pfrancois pfrancois 4096 Nov  4 13:03 ref
# Charge l'outil fastqc dans l’environnement  
module load fastqc/0.11.8 
# Obtenir de l’aide 
fastqc -h 
            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

	fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of 
    which will help to identify a different potential type of problem in your
    data.
    
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
    
    The options for the program as as follows:
    
    -h --help       Print this help file and exit
    
    -v --version    Print the version of the program and exit
    
    -o --outdir     Create all output files in the specified output directory.
                    Please note that this directory must exist as the program
                    will not create it.  If this option is not set then the 
                    output file for each sequence file is created in the same
                    directory as the sequence file which was processed.
                    
    --casava        Files come from raw casava output. Files in the same sample
                    group (differing only by the group number) will be analysed
                    as a set rather than individually. Sequences with the filter
                    flag set in the header will be excluded from the analysis.
                    Files must have the same names given to them by casava
                    (including being gzipped and ending with .gz) otherwise they
                    won't be grouped together correctly.
                    
    --nano          Files come from nanopore sequences and are in fast5 format. In
                    this mode you can pass in directories to process and the program
                    will take in all fast5 files within those directories and produce
                    a single output file from the sequences found in all files.                    
                    
    --nofilter      If running with --casava then don't remove read flagged by
                    casava as poor quality when performing the QC analysis.
                   
    --extract       If set then the zipped output file will be uncompressed in
                    the same directory after it has been created.  By default
                    this option will be set if fastqc is run in non-interactive
                    mode.
                    
    -j --java       Provides the full path to the java binary you want to use to
                    launch fastqc. If not supplied then java is assumed to be in
                    your path.
                   
    --noextract     Do not uncompress the output file after creating it.  You
                    should set this option if you do not wish to uncompress
                    the output when running in non-interactive mode.
                    
    --nogroup       Disable grouping of bases for reads >50bp. All reports will
                    show data for every base in the read.  WARNING: Using this
                    option will cause fastqc to crash and burn if you use it on
                    really long reads, and your plots may end up a ridiculous size.
                    You have been warned!
                    
    --min_length    Sets an artificial lower limit on the length of the sequence
                    to be shown in the report.  As long as you set this to a value
                    greater or equal to your longest read length then this will be
                    the sequence length used to create your read groups.  This can
                    be useful for making directly comaparable statistics from 
                    datasets with somewhat variable read lengths.
                    
    -f --format     Bypasses the normal sequence file format detection and
                    forces the program to use the specified format.  Valid
                    formats are bam,sam,bam_mapped,sam_mapped and fastq
                    
    -t --threads    Specifies the number of files which can be processed
                    simultaneously.  Each thread will be allocated 250MB of
                    memory so you shouldn't run more threads than your
                    available memory will cope with, and not more than
                    6 threads on a 32 bit machine
                  
    -c              Specifies a non-default file which contains the list of
    --contaminants  contaminants to screen overrepresented sequences against.
                    The file must contain sets of named contaminants in the
                    form name[tab]sequence.  Lines prefixed with a hash will
                    be ignored.

    -a              Specifies a non-default file which contains the list of
    --adapters      adapter sequences which will be explicity searched against
                    the library. The file must contain sets of named adapters
                    in the form name[tab]sequence.  Lines prefixed with a hash
                    will be ignored.
                    
    -l              Specifies a non-default file which contains a set of criteria
    --limits        which will be used to determine the warn/error limits for the
                    various modules.  This file can also be used to selectively 
                    remove some modules from the output all together.  The format
                    needs to mirror the default limits.txt file found in the
                    Configuration folder.
                    
   -k --kmers       Specifies the length of Kmer to look for in the Kmer content
                    module. Specified Kmer length must be between 2 and 10. Default
                    length is 7 if not specified.
                    
   -q --quiet       Supress all progress messages on stdout and only report errors.
   
   -d --dir         Selects a directory to be used for temporary files written when
                    generating report images. Defaults to system temp directory if
                    not specified.
                    
BUGS

    Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
    or in www.bioinformatics.babraham.ac.uk/bugzilla/
                   
    
# Lancer fastqc 
fastqc -f fastq -o ./ ../fastq/siNT_ER_E2_r3_chr21.fastq 2> siNT_ER_E2_r3_chr21_fastqc.log
Analysis complete for siNT_ER_E2_r3_chr21.fastq
less siNT_ER_E2_r3_chr21_fastqc.log  
# Que voyez vous ?
ls 

Jupyter Lab : accès au fichier html#

Côté gauche, avec l’onglet “répertoire” se placer à la racine du cluster

Sélectionner les répertoires jusqu’au répertoire de travail /shared/projects/<project>/chip-seq/qc (adapter <projet>)

Cliquer sur le fichier html pour l’ouvrir dans un nouvel onglet

Télécharger les résultats avec Cyberduck (OSX)#

Resultats#

Les résultats sont disponibles ici .

Interprétation#

Exploration des résultats de fastqc en interactif.

  • A quoi correspond le diagramme “Per base sequence quality”.

  • A quoi correspond le diagramme “Per sequence quality score” ?

  • A quoi correspond le diagramme “Per base sequence content” ?

  • A quoi correspond le diagramme “Per sequence GC content” ?

  • A quoi correspond le diagramme “Per sequence N content” ?

  • A quoi correspond le diagramme “Sequence length distribution” ?

  • A quoi correspond le diagramme “Sequence duplication level” ?

  • A quoi correspond le diagramme “Kmer content” ?

Basic Statistics#

The Basic Statistics module generates some simple composition statistics for the file analysed.

  • Filename

  • File type

  • Encoding: Says which ASCII encoding of quality values was found in this file.

  • Total Sequences: A count of the total number of sequences processed.

  • Filtered Sequences:

  • Sequence Length

Warning

Basic Statistics never raises a warning.

Error

Basic Statistics never raises an error.

Per Base Sequence Quality#

This view shows an overview of the range of quality values across all bases at each position in the FastQ file. For each position a BoxWhisker type plot is drawn. The elements of the plot are as follows: The central red line is the median value The yellow box represents the inter-quartile range (25-75%) The upper and lower whiskers represent the 10% and 90% points The blue line represents the mean quality The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read.

Warning

A warning will be issued if the lower quartile for any base is less than 10, or if the median for any base is less than 25.

Error

This module will raise a failure if the lower quartile for any base is less than 5 or if the median for any base is less than 20.

Per Tile Sequence Quality#

The plot shows the deviation from the average quality for each tile. The colours are on a cold to hot scale, with cold colours being positions where the quality was at or above the average for that base in the run, and hotter colours indicate that a tile had worse qualities than other tiles for that base. In the example below you can see that certain tiles show consistently poor quality. A good plot should be blue all over.

Warning

This module will issue a warning if any tile shows a mean Phred score more than 2 less than the mean for that base across all tiles.

Error

This module will issue a warning if any tile shows a mean Phred score more than 5 less than the mean for that base across all tiles.

Per Sequence Quality Scores#

The per sequence quality score report allows you to see if a subset of your sequences have universally low quality values. It is often the case that a subset of sequences will have universally poor quality, often because they are poorly imaged (on the edge of the field of view etc), however these should represent only a small percentage of the total sequences.

If a significant proportion of the sequences in a run have overall low quality then this could indicate some kind of systematic problem - possibly with just part of the run (for example one end of a flowcell).

Warning

A warning is raised if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate.

Error

An error is raised if the most frequently observed mean quality is below 20 - this equates to a 1% error rate.

Per Base Sequence Content#

Per Base Sequence Content plots out the proportion of each base position in a file for which each of the four normal DNA bases has been called.

In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other.

If you see strong biases which change in different bases then this usually indicates an overrepresented sequence which is contaminating your library. A bias which is consistent across all bases either indicates that the original library was sequence biased, or that there was a systematic problem during the sequencing of the library.

Warning

This module issues a warning if the difference between A and T, or G and C is greater than 10% in any position.

Error

This module will fail if the difference between A and T, or G and C is greater than 20% in any position.

Per Sequence GC Content#

This module measures the GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content. In a normal random library you would expect to see a roughly normal distribution of GC content where the central peak corresponds to the overall GC content of the underlying genome. Since we don’t know the the GC content of the genome the modal GC content is calculated from the observed data and used to build a reference distribution. An unusually shaped distribution could indicate a contaminated library or some other kinds of biased subset. A normal distribution which is shifted indicates some systematic bias which is independent of base position. If there is a systematic bias which creates a shifted normal distribution then this won’t be flagged as an error by the module since it doesn’t know what your genome’s GC content should be.

Warning

A warning is raised if the sum of the deviations from the normal distribution represents more than 15% of the reads.

Error

This module will indicate a failure if the sum of the deviations from the normal distribution represents more than 30% of the reads.

Per Base N Content#

If a sequencer is unable to make a base call with sufficient confidence then it will normally substitute an N rather than a conventional base call This module plots out the percentage of base calls at each position for which an N was called.

It’s not unusual to see a very low proportion of Ns appearing in a sequence, especially nearer the end of a sequence. However, if this proportion rises above a few percent it suggests that the analysis pipeline was unable to interpret the data well enough to make valid base calls.

Warning

This module raises a warning if any position shows an N content of >5%.

Error

This module will raise an error if any position shows an N content of >20%.

Sequence Length Distribution#

Some high throughput sequencers generate sequence fragments of uniform length, but others can contain reads of wildly varying lengths. Even within uniform length libraries some pipelines will trim sequences to remove poor quality base calls from the end. This module generates a graph showing the distribution of fragment sizes in the file which was analysed. In many cases this will produce a simple graph showing a peak only at one size, but for variable length FastQ files this will show the relative amounts of each different size of sequence fragment.

Warning

This module will raise a warning if all sequences are not the same length.

Error

This module will raise an error if any of the sequences have zero length.

Duplicate Sequences#

In a diverse library most sequences will occur only once in the final set. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).

This module counts the degree of duplication for every sequence in the set and creates a plot showing the relative number of sequences with different degrees of duplication.

Warning

This module will issue a warning if non-unique sequences make up more than 20% of the total.

Error

This module will issue a error if non-unique sequences make up more than 50% of the total.

Overrepresented Sequences#

A normal high-throughput library will contain a diverse set of sequences, with no individual sequence making up a tiny fraction of the whole. Finding that a single sequence is very overrepresented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as you expected. This module lists all of the sequence which make up more than 0.1% of the total. To conserve memory only sequences which appear in the first 200,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn’t appear at the start of the file for some reason could be missed by this module.

Warning

This module will issue a warning if any sequence is found to represent more than 0.1% of the total.

Error

This module will issue an error if any sequence is found to represent more than 1% of the total.

Adapter Content#

The Kmer Content module will do a generic analysis of all of the Kmers in your library to find those which do not have even coverage through the length of your reads. This can find a number of different sources of bias in the library which can include the presence of read-through adapter sequences building up on the end of your sequences. You can however find that the presence of any overrepresented sequences in your library (such as adapter dimers) will cause the Kmer plot to be dominated by the Kmers these sequences contain, and that it’s not always easy to see if there are other biases present in which you might be interested.

Warning

This module will issue a warning if any sequence is present in more than 5% of all reads.

Error

This module will issue a warning if any sequence is present in more than 10% of all reads.