Pysam is a python module that makes it easy to read and manipulate mapped short read sequence data stored in SAM/BAM files. It is a lightweight wrapper of the samtools C-API.
To use the module to read a file in BAM format, open the file with Samfile:
import pysam
samfile = pysam.Samfile( "ex1.bam", "rb" )
Now the file is open you can iterate over all of the read mapping to a specified region using fetch(). Each iteration returns a AlignedRead object which represents a single read along with its fields and optional tags:
for alignedread in samfile.fetch('chr1', 100, 120):
print alignedread
samfile.close()
To give:
EAS56_57:6:190:289:82 0 99 <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; 69 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA 0 192 1
EAS56_57:6:190:289:82 0 99 <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; 137 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC 73 64 1
EAS51_64:3:190:727:308 0 102 <<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<844 99 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG 99 18 1
...
You can also write to a Samfile:
import pysam
samfile = pysam.Samfile("ex1.bam", "rb")
pairedreads = pysam.Samfile("allpaired.bam", "wb", template=samfile)
for read in samfile.fetch():
if read.is_paired:
pairedreads.write(read)
pairedreads.close()
samfile.close()
An alternative way of accessing the data in a SAM file is by iterating over each base of a specified region using the pileup() method. Each iteration returns a PileupColumn which represents all the reads in the SAM file that map to a single base in the reference sequence. The list of reads are represented as PileupRead objects in the PileupColumn.pileups property:
import pysam
samfile = pysam.Samfile("ex1.bam", "rb" )
for pileupcolumn in samfile.pileup( 'chr1', 100, 120):
print
print 'coverage at base %s = %s' % (pileupcolumn.pos , pileupcolumn.n)
for pileupread in pileupcolumn.pileups:
print '\tbase in read %s = %s' % (pileupread.alignment.qname, pileupread.alignment.seq[pileupread.qpos])
samfile.close()
To give:
coverage at base 99 = 1
base in read EAS56_57:6:190:289:82 = A
coverage at base 100 = 1
base in read EAS56_57:6:190:289:82 = G
coverage at base 101 = 1
base in read EAS56_57:6:190:289:82 = G
coverage at base 102 = 2
base in read EAS56_57:6:190:289:82 = G
base in read EAS51_64:3:190:727:308 = G
...
Commands available in csamtools are available as simple function calls. For example:
pysam.sort( "ex1.bam", "output" )
corresponds to the command line:
samtools sort ex1.bam output
More detailed usage instructions is at Usage.
Note
Coordinates in pysam are always 0-based (following the python convention). SAM text files use 1-based coordinates. The above examples can be run in the tests directory of the installation directory. Type ‘make’ before running them.
See also
(filename, mode=’r’, template = None, referencenames = None, referencelengths = None, text = NULL, header = None)
A SAM/BAM formatted file. The file is automatically opened.
mode should be r for reading or w for writing. The default is text mode so for binary (BAM) I/O you should append b for compressed or u for uncompressed BAM output. Use h to output header information in text (TAM) mode.
If b is present, it must immediately follow r or w. Valid modes are r, w, wh, rb, wb and wbu. For instance, to open a BAM formatted file for reading, type:
import pysam
f = pysam.Samfile('ex1.bam','rb')
If an index for a BAM file exists (.bai), it will be opened automatically. Without an index random access to reads via fetch() and pileup() is disabled.
For writing, the header of a SAM file/BAM file can be constituted from several sources (see also the samtools format specification):
- If template is given, the header is copied from a another Samfile (template must be of type Samfile).
- If header is given, the header is built from a multi-level dictionary. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs.
- If text is given, new header text is copied from raw text.
- The names (referencenames) and lengths (referencelengths) are supplied directly as lists.
closes the pysam.Samfile.
(reference = None, start = None, end = None, region = None, callback = None, until_eof = False)
count reads region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.
Note that a TAM file does not allow random access. If region or reference are given, an exception is raised.
fetch aligned reads in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.
Without reference or region all reads will be fetched. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file. If until_eof is given, all reads from the current file position will be returned in order as they are within the file.
If only reference is set, all reads aligned to reference will be fetched.
The method returns an iterator of type pysam.IteratorRow unless a callback is provided. If *callback is given, the callback will be executed for each position within the region. Note that callbacks currently work only, if region or reference is given.
Note that a SAM file does not allow random access. If region or reference are given, an exception is raised.
number of filename associated with this object.
header information within the sam file. The records and fields are returned as a two-level dictionary.
tuple of the lengths of the reference sequences. The lengths are in the same order as pysam.Samfile.references
return the mate of AlignedRead read.
Throws a ValueError if read is unpaired or the mate is unmapped.
Note
Calling this method will change the file position. This might interfere with any iterators that have not re-opened the file.
x.next() -> the next value, or raise StopIteration
perform a pileup within a region. The region is specified by reference, start and end (using 0-based indexing). Alternatively, a samtools region string can be supplied.
Without reference or region all reads will be used for the pileup. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file.
The method returns an iterator of type pysam.IteratorColumn unless a callback is provided. If a *callback is given, the callback will be executed for each column within the region.
Note that SAM formatted files do not allow random access. In these files, if a region or reference are given an exception is raised.
Optional kwargs to the iterator:
The stepper controlls how the iterator advances. Possible options for the stepper are
A FastaFile object
Note
all reads which overlap the region are returned. The first base returned will be the first base of the first read not necessarily the first base of the region used in the query.
move file pointer to position offset, see pysam.Samfile.tell().
return current file position
write a single pysam.AlignedRead to disk.
returns the number of bytes written.
(filename)
A FASTA file. The file is automatically opened.
The file expects an indexed fasta file.
(reference = None, start = None, end = None, region = None)
fetch AlignedRead() objects in a region using 0-based indexing.
The region is specified by reference, start and end.
fetch returns an empty string if the region is out of range or addresses an unknown reference.
If reference is given and start is None, the sequence from the first base is returned. Similarly, if end is None, the sequence until the last base is returned.
Alternatively, a samtools region string can be supplied.
number of filename associated with this object.
abstract base class for iterators over mapped reads.
Various iterators implement different behaviours for wrapping around contig boundaries. Examples include:
The method Samfile.fetch() returns an IteratorRow.
abstract base class for iterators over columns.
IteratorColumn objects wrap the pileup functionality of samtools.
For reasons of efficiency, the iterator points to the current pileup buffer. The pileup buffer is updated at every iteration. This might cause some unexpected behavious. For example, consider the conversion to a list:
f = Samfile("file.bam", "rb")
result = list( f.pileup() )
Here, result will contain n objects of type PileupProxy for n columns, but each object in result will contain the same information.
The desired behaviour can be achieved by list comprehension:
result = [ x.pileups() for x in f.pileup() ]
result will be a list of n lists of objects of type PileupRead.
If the iterator is associated with a Fastafile using the addReference() method, then the iterator will export the current sequence via the methods getSequence() and seq_len().
Optional kwargs to the iterator
The stepper controlls how the iterator advances. Possible options for the stepper are
add reference sequences in fastafile to iterator.
return true if iterator is associated with a reference
current sequence length.
Class representing an aligned read. see SAM format specification for meaning of fields (http://samtools.sourceforge.net/).
This class stores a handle to the samtools C-structure representing an aligned read. Member read access is forwarded to the C-structure and converted into python objects. This implementation should be fast, as only the data needed is converted.
For write access, the C-structure is updated in-place. This is not the most efficient way to build BAM entries, as the variable length data is concatenated and thus needs to resized if a field is updated. Furthermore, the BAM entry might be in an inconsistent state. The validate() method can be used to check if an entry is consistent.
One issue to look out for is that the sequence should always be set before the quality scores. Setting the sequence will also erase any quality scores that were set previously.
aligned end position of the read (read only). Returns None if not available.
aligned length of the read (read only). Returns None if not available.
properties bin
return -1,0,1, if contents in this are binary <,=,> to other
returns list of fieldnames/values in pretty format for debugging
properties flag
true if optical or PCR duplicate
true if read is paired in sequencing
true if read is mapped in a proper pair
true if QC failure
true if this is read1
true if this is read2
true if read is mapped to reverse strand
true if not primary alignment
true if read itself is unmapped
the insert size
mapping quality
true is read is mapped to reverse strand
true if the mate is unmapped
the position of the mate
retrieves optional data given a two-letter tag
0-based leftmost coordinate
end index of the aligned query portion of the sequence (0-based, exclusive)
Length of the aligned query sequence
the query name (None if not present)
aligned query sequence quality values (None if not present)
start index of the aligned query portion of the sequence (0-based, inclusive)
read sequence base qualities, including soft clipped bases (None if not present)
aligned portion of the read and excludes any flanking bases that were soft clipped (None if not present)
SAM/BAM files may included extra flanking bases sequences that were not part of the alignment. These bases may be the result of the Smith-Waterman or other algorithms, which may not require alignments that begin at the first residue or end at the last. In addition, extra sequencing adapters, multiplex identifiers, and low-quality bases that were not considered for alignment may have been retained.
length of the read (read only). Returns 0 if not given.
target ID
DEPRECATED from pysam-0.4 - use tid in the future. The rname field caused a lot of confusion as it returns the target ID instead of the reference sequence name.
Note
This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.Samfile.getrname()
read sequence bases, including soft clipped bases (None if not present)
the tags in the AUX field.
This property permits convenience access to the tags. Changes it the returned list will not update the tags automatically. Instead, the following is required for adding a new tag:
read.tags = read.tags + [("RG",0)]
target ID
Note
This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.Samfile.getrname()
A pileup column. A pileup column contains all the reads that map to a certain target base.
A pileup column. A pileup column contains all the reads that map to a certain target base.
This class is a proxy for results returned by the samtools pileup engine. If the underlying engine iterator advances, the results of this column will change.
number of reads mapping to this column.
list of reads (pysam.PileupRead) aligned to this column
the chromosome ID as is defined in the header
A read aligned to a column.
a pysam.AlignedRead object of the aligned read
indel length; 0 for no indel, positive for ins and negative for del
1 iff the base on the padded read is a deletion
position of the read base at the pileup site, 0-based
(IteratorColumn iterator)
call SNPs within a region.
iterator is a pileup iterator. SNPs will be called on all positions returned by this iterator.
This caller is fast if SNPs are called over large continuous regions. It is slow, if instantiated frequently and in random order as the sequence will have to be reloaded.
x.next() -> the next value, or raise StopIteration
(IteratorColumn iterator_column )
The samtools SNP caller.
This object will call SNPs in samfile against the reference sequence in fasta.
This caller is fast for calling few SNPs in selected regions.
It is slow, if called over large genomic regions.
call a snp on chromosome reference and position pos.
returns a SNPCall object.
(IteratorColumn iterator_column )
The samtools SNP caller.
This object will call SNPs in samfile against the reference sequence in fasta.
This caller is fast for calling few SNPs in selected regions.
It is slow, if called over large genomic regions.
call a snp on chromosome reference and position pos.
returns a SNPCall object or None, if no indel call could be made.
(IteratorColumn iterator)
call indels within a region.
iterator is a pileup iterator. SNPs will be called on all positions returned by this iterator.
This caller is fast if SNPs are called over large continuous regions. It is slow, if instantiated frequently and in random order as the sequence will have to be reloaded.
x.next() -> the next value, or raise StopIteration
tabix_index(filename, force=False, seq_col=None, start_col=None, end_col=None, preset=None, meta_char=’#’, zerobased=False)
index tab-separated filename using tabix.
An existing index will not be overwritten unless force is set.
The index will be built from coordinates in columns seq_col, start_col and end_col.
The contents of filename have to be sorted by contig and position - the method does not check if the file is sorted.
Column indices are 0-based. Coordinates in the file are assumed to be 1-based.
If preset is provided, the column coordinates are taken from a preset. Valid values for preset are “gff”, “bed”, “sam”, “vcf”, psltbl”, “pileup”.
Lines beginning with meta_char and the first line_skip lines will be skipped.
If filename does not end in ”.gz”, it will be automatically compressed. The original file will be removed and only the compressed file will be retained.
If filename ends in gz, the file is assumed to be already compressed with bgzf.
returns the filename of the compressed data
tabix_compress(filename_in, filename_out, force=False)
compress filename_in writing the output to filename_out.
Raise an IOError if filename_out already exists, unless force is set.
(filename, mode=’r’)
opens a tabix file for reading. A missing index (filename + ”.tbi”) will raise an exception.
chromosome names
Tabixfile.fetch(self, reference=None, start=None, end=None, region=None, parser=None)
fetch one or more rows in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.
Without reference or region all entries will be fetched.
If only reference is set, all reads matching on reference will be fetched.
If parser is None, the results are returned as an unparsed string. Otherwise, parser is assumed to be a functor that will return parsed data (see for example asTuple() and asGTF()).
exception raised in case of an error incurred in the samtools library.
samtools dispatcher.
Emulates the samtools command line as module calls.
Captures stdout and stderr.
Raises a pysam.SamtoolsError exception in case samtools exits with an error code other than 0.
Some command line options are associated with parsers. For example, the samtools command “pileup -c” creates a tab-separated table on standard output. In order to associate parsers with options, an optional list of parsers can be supplied. The list will be processed in order checking for the presence of each option.
If no parser is given or no appropriate parser is found, the stdout output of samtools commands will be returned.
return the samtools usage information for this command