pysam - An interface for reading and writing SAM files

Pysam is a python module that makes it easy to read and manipulate mapped short read sequence data stored in SAM/BAM files. It is a lightweight wrapper of the samtools C-API.

To use the module to read a file in BAM format, open the file with Samfile:

import pysam
samfile = pysam.Samfile( "ex1.bam", "rb" )

Now the file is open you can iterate over all of the read mapping to a specified region using fetch(). Each iteration returns a AlignedRead object which represents a single read along with its fields and optional tags:

for alignedread in samfile.fetch('chr1', 100, 120):
     print alignedread

samfile.close()

To give:

EAS56_57:6:190:289:82       0       99      <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;     69      CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA     0       192     1
EAS56_57:6:190:289:82       0       99      <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;     137     AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC     73      64      1
EAS51_64:3:190:727:308      0       102     <<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<844     99      GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG     99      18      1
...

You can also write to a Samfile:

import pysam
samfile = pysam.Samfile("ex1.bam", "rb")
pairedreads = pysam.Samfile("allpaired.bam", "wb", template=samfile)
for read in samfile.fetch():
     if read.is_paired:
             pairedreads.write(read)

pairedreads.close()
samfile.close()

An alternative way of accessing the data in a SAM file is by iterating over each base of a specified region using the pileup() method. Each iteration returns a PileupColumn which represents all the reads in the SAM file that map to a single base in the reference sequence. The list of reads are represented as PileupRead objects in the PileupColumn.pileups property:

import pysam
samfile = pysam.Samfile("ex1.bam", "rb" )
for pileupcolumn in samfile.pileup( 'chr1', 100, 120):
    print
    print 'coverage at base %s = %s' % (pileupcolumn.pos , pileupcolumn.n)
    for pileupread in pileupcolumn.pileups:
        print '\tbase in read %s = %s' % (pileupread.alignment.qname, pileupread.alignment.seq[pileupread.qpos])

samfile.close()

To give:

coverage at base 99 = 1
    base in read EAS56_57:6:190:289:82 = A

coverage at base 100 = 1
    base in read EAS56_57:6:190:289:82 = G

coverage at base 101 = 1
    base in read EAS56_57:6:190:289:82 = G

coverage at base 102 = 2
    base in read EAS56_57:6:190:289:82 = G
    base in read EAS51_64:3:190:727:308 = G
...

Commands available in csamtools are available as simple function calls. For example:

pysam.sort( "ex1.bam", "output" )

corresponds to the command line:

samtools sort ex1.bam output

More detailed usage instructions is at Usage.

Note

Coordinates in pysam are always 0-based (following the python convention). SAM text files use 1-based coordinates. The above examples can be run in the tests directory of the installation directory. Type ‘make’ before running them.

See also

http://code.google.com/p/pysam/
The pysam Google code page, containing source code and download instructions
http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/contents.html
The pysam website containing documentation

API

class pysam.Samfile

(filename, mode=’r’, template = None, referencenames = None, referencelengths = None, text = NULL, header = None)

A SAM/BAM formatted file. The file is automatically opened.

mode should be r for reading or w for writing. The default is text mode so for binary (BAM) I/O you should append b for compressed or u for uncompressed BAM output. Use h to output header information in text (TAM) mode.

If b is present, it must immediately follow r or w. Valid modes are r, w, wh, rb, wb and wbu. For instance, to open a BAM formatted file for reading, type:

import pysam
f = pysam.Samfile('ex1.bam','rb')

If an index for a BAM file exists (.bai), it will be opened automatically. Without an index random access to reads via fetch() and pileup() is disabled.

For writing, the header of a SAM file/BAM file can be constituted from several sources (see also the samtools format specification):

  1. If template is given, the header is copied from a another Samfile (template must be of type Samfile).
  2. If header is given, the header is built from a multi-level dictionary. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs.
  3. If text is given, new header text is copied from raw text.
  4. The names (referencenames) and lengths (referencelengths) are supplied directly as lists.
close

closes the pysam.Samfile.

count

(reference = None, start = None, end = None, region = None, callback = None, until_eof = False)

count reads region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.

Note that a TAM file does not allow random access. If region or reference are given, an exception is raised.

fetch

fetch aligned reads in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.

Without reference or region all reads will be fetched. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file. If until_eof is given, all reads from the current file position will be returned in order as they are within the file.

If only reference is set, all reads aligned to reference will be fetched.

The method returns an iterator of type pysam.IteratorRow unless a callback is provided. If *callback is given, the callback will be executed for each position within the region. Note that callbacks currently work only, if region or reference is given.

Note that a SAM file does not allow random access. If region or reference are given, an exception is raised.

filename

number of filename associated with this object.

getrname

convert numerical tid into reference name.

gettid

convert reference name into numerical tid

returns -1 if reference is not known.

header

header information within the sam file. The records and fields are returned as a two-level dictionary.

lengths

tuple of the lengths of the reference sequences. The lengths are in the same order as pysam.Samfile.references

mate

return the mate of AlignedRead read.

Throws a ValueError if read is unpaired or the mate is unmapped.

Note

Calling this method will change the file position. This might interfere with any iterators that have not re-opened the file.

next

x.next() -> the next value, or raise StopIteration

nreferences

number of reference sequences in the file.

pileup

perform a pileup within a region. The region is specified by reference, start and end (using 0-based indexing). Alternatively, a samtools region string can be supplied.

Without reference or region all reads will be used for the pileup. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file.

The method returns an iterator of type pysam.IteratorColumn unless a callback is provided. If a *callback is given, the callback will be executed for each column within the region.

Note that SAM formatted files do not allow random access. In these files, if a region or reference are given an exception is raised.

Optional kwargs to the iterator:

stepper

The stepper controlls how the iterator advances. Possible options for the stepper are

all
use all reads for pileup.
samtools
same filter and read processing as in csamtools pileup
fastafile
A FastaFile object
mask
Skip all reads with bits set in mask.

Note

all reads which overlap the region are returned. The first base returned will be the first base of the first read not necessarily the first base of the region used in the query.

references

tuple with the names of reference sequences.

seek

move file pointer to position offset, see pysam.Samfile.tell().

tell

return current file position

text

full contents of the sam file header as a string.

write

write a single pysam.AlignedRead to disk.

returns the number of bytes written.

class pysam.Fastafile

(filename)

A FASTA file. The file is automatically opened.

The file expects an indexed fasta file.

TODO:
add automatic indexing. add function to get sequence names.
close
fetch

(reference = None, start = None, end = None, region = None)

fetch AlignedRead() objects in a region using 0-based indexing.

The region is specified by reference, start and end.

fetch returns an empty string if the region is out of range or addresses an unknown reference.

If reference is given and start is None, the sequence from the first base is returned. Similarly, if end is None, the sequence until the last base is returned.

Alternatively, a samtools region string can be supplied.

filename

number of filename associated with this object.

class pysam.IteratorRow

abstract base class for iterators over mapped reads.

Various iterators implement different behaviours for wrapping around contig boundaries. Examples include:

pysam.IteratorRowRegion
iterate within a single contig and a defined region.
pysam.IteratorRowAll
iterate until EOF. This iterator will also include unmapped reads.
pysam.IteratorRowAllRefs
iterate over all reads in all reference sequences.

The method Samfile.fetch() returns an IteratorRow.

class pysam.IteratorColumn

abstract base class for iterators over columns.

IteratorColumn objects wrap the pileup functionality of samtools.

For reasons of efficiency, the iterator points to the current pileup buffer. The pileup buffer is updated at every iteration. This might cause some unexpected behavious. For example, consider the conversion to a list:

f = Samfile("file.bam", "rb")
result = list( f.pileup() )

Here, result will contain n objects of type PileupProxy for n columns, but each object in result will contain the same information.

The desired behaviour can be achieved by list comprehension:

result = [ x.pileups() for x in f.pileup() ]

result will be a list of n lists of objects of type PileupRead.

If the iterator is associated with a Fastafile using the addReference() method, then the iterator will export the current sequence via the methods getSequence() and seq_len().

Optional kwargs to the iterator

stepper

The stepper controlls how the iterator advances. Possible options for the stepper are

all
use all reads for pileup.
samtools
same filter and read processing as in csamtools pileup
fastafile
A FastaFile object
mask
Skip all reads with bits set in mask.
addReference

add reference sequences in fastafile to iterator.

hasReference

return true if iterator is associated with a reference

seq_len

current sequence length.

class pysam.AlignedRead

Class representing an aligned read. see SAM format specification for meaning of fields (http://samtools.sourceforge.net/).

This class stores a handle to the samtools C-structure representing an aligned read. Member read access is forwarded to the C-structure and converted into python objects. This implementation should be fast, as only the data needed is converted.

For write access, the C-structure is updated in-place. This is not the most efficient way to build BAM entries, as the variable length data is concatenated and thus needs to resized if a field is updated. Furthermore, the BAM entry might be in an inconsistent state. The validate() method can be used to check if an entry is consistent.

One issue to look out for is that the sequence should always be set before the quality scores. Setting the sequence will also erase any quality scores that were set previously.

aend

aligned end position of the read (read only). Returns None if not available.

alen

aligned length of the read (read only). Returns None if not available.

bin

properties bin

cigar

the cigar alignment (None if not present).

compare

return -1,0,1, if contents in this are binary <,=,> to other

fancy_str

returns list of fieldnames/values in pretty format for debugging

flag

properties flag

is_duplicate

true if optical or PCR duplicate

is_paired

true if read is paired in sequencing

is_proper_pair

true if read is mapped in a proper pair

is_qcfail

true if QC failure

is_read1

true if this is read1

is_read2

true if this is read2

is_reverse

true if read is mapped to reverse strand

is_secondary

true if not primary alignment

is_unmapped

true if read itself is unmapped

isize

the insert size

mapq

mapping quality

mate_is_reverse

true is read is mapped to reverse strand

mate_is_unmapped

true if the mate is unmapped

mpos

the position of the mate

mrnm

the reference id of the mate

opt

retrieves optional data given a two-letter tag

pos

0-based leftmost coordinate

qend

end index of the aligned query portion of the sequence (0-based, exclusive)

qlen

Length of the aligned query sequence

qname

the query name (None if not present)

qqual

aligned query sequence quality values (None if not present)

qstart

start index of the aligned query portion of the sequence (0-based, inclusive)

qual

read sequence base qualities, including soft clipped bases (None if not present)

query

aligned portion of the read and excludes any flanking bases that were soft clipped (None if not present)

SAM/BAM files may included extra flanking bases sequences that were not part of the alignment. These bases may be the result of the Smith-Waterman or other algorithms, which may not require alignments that begin at the first residue or end at the last. In addition, extra sequencing adapters, multiplex identifiers, and low-quality bases that were not considered for alignment may have been retained.

rlen

length of the read (read only). Returns 0 if not given.

rname

target ID

DEPRECATED from pysam-0.4 - use tid in the future. The rname field caused a lot of confusion as it returns the target ID instead of the reference sequence name.

Note

This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.Samfile.getrname()

seq

read sequence bases, including soft clipped bases (None if not present)

tags

the tags in the AUX field.

This property permits convenience access to the tags. Changes it the returned list will not update the tags automatically. Instead, the following is required for adding a new tag:

read.tags = read.tags + [("RG",0)]
tid

target ID

Note

This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.Samfile.getrname()

class pysam.PileupColumn

A pileup column. A pileup column contains all the reads that map to a certain target base.

tid
chromosome ID as is defined in the header
pos
the target base coordinate (0-based)
n
number of reads mapping to this column
pileups
list of reads (pysam.PileupRead) aligned to this column
class pysam.PileupProxy

A pileup column. A pileup column contains all the reads that map to a certain target base.

tid
chromosome ID as is defined in the header
pos
the target base coordinate (0-based)
n
number of reads mapping to this column
pileups
list of reads (pysam.PileupRead) aligned to this column

This class is a proxy for results returned by the samtools pileup engine. If the underlying engine iterator advances, the results of this column will change.

n

number of reads mapping to this column.

pileups

list of reads (pysam.PileupRead) aligned to this column

pos
tid

the chromosome ID as is defined in the header

class pysam.PileupRead

A read aligned to a column.

alignment

a pysam.AlignedRead object of the aligned read

indel

indel length; 0 for no indel, positive for ins and negative for del

is_del

1 iff the base on the padded read is a deletion

is_head
is_tail
level
qpos

position of the read base at the pileup site, 0-based

class pysam.IteratorSNPCalls

(IteratorColumn iterator)

call SNPs within a region.

iterator is a pileup iterator. SNPs will be called on all positions returned by this iterator.

This caller is fast if SNPs are called over large continuous regions. It is slow, if instantiated frequently and in random order as the sequence will have to be reloaded.

next

x.next() -> the next value, or raise StopIteration

class pysam.SNPCaller

(IteratorColumn iterator_column )

The samtools SNP caller.

This object will call SNPs in samfile against the reference sequence in fasta.

This caller is fast for calling few SNPs in selected regions.

It is slow, if called over large genomic regions.

call

call a snp on chromosome reference and position pos.

returns a SNPCall object.

class pysam.IndelCaller

(IteratorColumn iterator_column )

The samtools SNP caller.

This object will call SNPs in samfile against the reference sequence in fasta.

This caller is fast for calling few SNPs in selected regions.

It is slow, if called over large genomic regions.

call

call a snp on chromosome reference and position pos.

returns a SNPCall object or None, if no indel call could be made.

class pysam.IteratorIndelCalls

(IteratorColumn iterator)

call indels within a region.

iterator is a pileup iterator. SNPs will be called on all positions returned by this iterator.

This caller is fast if SNPs are called over large continuous regions. It is slow, if instantiated frequently and in random order as the sequence will have to be reloaded.

next

x.next() -> the next value, or raise StopIteration

pysam.tabix_index()

tabix_index(filename, force=False, seq_col=None, start_col=None, end_col=None, preset=None, meta_char=’#’, zerobased=False)

index tab-separated filename using tabix.

An existing index will not be overwritten unless force is set.

The index will be built from coordinates in columns seq_col, start_col and end_col.

The contents of filename have to be sorted by contig and position - the method does not check if the file is sorted.

Column indices are 0-based. Coordinates in the file are assumed to be 1-based.

If preset is provided, the column coordinates are taken from a preset. Valid values for preset are “gff”, “bed”, “sam”, “vcf”, psltbl”, “pileup”.

Lines beginning with meta_char and the first line_skip lines will be skipped.

If filename does not end in ”.gz”, it will be automatically compressed. The original file will be removed and only the compressed file will be retained.

If filename ends in gz, the file is assumed to be already compressed with bgzf.

returns the filename of the compressed data

pysam.tabix_compress()

tabix_compress(filename_in, filename_out, force=False)

compress filename_in writing the output to filename_out.

Raise an IOError if filename_out already exists, unless force is set.

class pysam.Tabixfile

(filename, mode=’r’)

opens a tabix file for reading. A missing index (filename + ”.tbi”) will raise an exception.

contigs

chromosome names

fetch

Tabixfile.fetch(self, reference=None, start=None, end=None, region=None, parser=None)

fetch one or more rows in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.

Without reference or region all entries will be fetched.

If only reference is set, all reads matching on reference will be fetched.

If parser is None, the results are returned as an unparsed string. Otherwise, parser is assumed to be a functor that will return parsed data (see for example asTuple() and asGTF()).

class pysam.asTuple

converts a tabix row into a python tuple.

class pysam.asGTF

converts a tabix row into a GTF record.

exception pysam.SamtoolsError(value)

exception raised in case of an error incurred in the samtools library.

class pysam.SamtoolsDispatcher(dispatch, parsers)

samtools dispatcher.

Emulates the samtools command line as module calls.

Captures stdout and stderr.

Raises a pysam.SamtoolsError exception in case samtools exits with an error code other than 0.

Some command line options are associated with parsers. For example, the samtools command “pileup -c” creates a tab-separated table on standard output. In order to associate parsers with options, an optional list of parsers can be supplied. The list will be processed in order checking for the presence of each option.

If no parser is given or no appropriate parser is found, the stdout output of samtools commands will be returned.

getMessages()
usage()

return the samtools usage information for this command

Table Of Contents

Previous topic

pysam: samtools interface for python

Next topic

Usage

This Page