bioseq
packageThe purpose of this vignette is to
illustrate the use of the bioseq
package with a practical
case. For a general introduction to the package please see the vignette
Introduction to the bioseq package. Here we show how the
functions of bioseq
can be used to clean and explore real
biological data. We take as example a set of rbcL sequences of
Fragilaria species retrieved from NCBI (these data ared distributed with
the package, see ?fragilaria
). Fragilaria is a genus of
freshwater and saltwater diatoms. Diatoms are a major group of
microscopic algae.
Data obtained from generic repository like NCBI or BOLD often need to
be filtered and cleaned before they can be used in a project. Here we
show how this can be accomplished using bioseq
and more
particularly how to perform some basic cleaning, extract a specific
barcode region from raw sequences, and use functions from other
packages. Some of the code chunks presented here can be adapted to
reconstruct a phylogeny for a particular taxonomic group or to prepare a
reference database against which metabarcoding sequencing reads will be
matched for taxonomic identification.
We start by loading the package bioseq
. It is assumed
that you already intalled the package either from CRAN using the
function install.packages("bioseq)
or from GitHub using
remotes::install_github("fkeck/bioseq")
.
We additionaly load the tidyverse
package. This is a
meta-package which loads several useful packages that we will need for
later. In particular we will use tibble
and
dplyr
to work with tibbles. This allows to work with other
types of data and to centralize everything in one place. To learn more
about the tidyverse and these packages you can visit.
Genetic data are often delivered in FASTA format. In this example we
will use rbcL sequences of Fragilaria species retrieved from NCBI and
directly available as unparsed FASTA in the package. We can use the
function read_fasta
to directly read a FASTA file into R.
Here we read a vector of character already available in R, but most of
the time the argument file
will be used to pass a path to a
file available locally or via the network.
data(fragilaria, package = "bioseq")
fra_data <- read_fasta(fragilaria)
fra_data
#> DNA vector of 41 sequences
#> MK576039.1 Fragil... TTCATCGCTTATGAATGTGATTTATTTGAAGAAGGTTCTTTAGCTAACTTAACAGCTTCT... + 575 bases
#> MK672908.1 Fragil... GCTTTCATCGCTTACGAATGTGATTTATTTGAAGAAGGTTCTTTAGCTAACTTAACAGCT... + 1107 bases
#> MK672907.1 Fragil... ATGTCTCAATCTGTATCAGAACGGACTCGAATCAAAAGTGACCGTTACGAATCTGGTGTA... + 1386 bases
#> MK672906.1 Fragil... ATGTCTCAATCTGTATCAGAACGGACTCGAATCAAAAGTGACCGTTACGAATCTGGTGTA... + 1386 bases
#> KT072939.1 Fragil... TTAGCATTATTCCGTATTACTCCACAACCAGGTGTAGATCCAGTAGAAGCAGCAGCAGCA... + 744 bases
#> KT072928.1 Fragil... TTAGCATTATTCCGTATTACTCCACAACCAGGTGTAGATCCAGTAGAAGCAGCAGCAGCA... + 744 bases
#> KT072903.1 Fragil... TTAGCATTATTCCGTATTACTCCACAACCAGGTGTAGATCCAGTAGAAGCAGCAGCAGCA... + 744 bases
#> MF092945.1 Fragil... AGTGACCGTTACGAATCTGGTGTAATCCCTTACGCTAAAATGGGTTACTGGGATGCTTCA... + 1263 bases
#> MF092943.1 Fragil... AGTGACCGTTACGAATCTGGTGTAATCCCTTACGCTAAAATGGGTTACTGGGATGCTTCA... + 1263 bases
#> JQ003574.1 Synedr... AAAAANTATGGGCGTGTAGTATACGAAGGTTTAAAAGGTGGATTAGACTTCTTAAAAGAT... + 390 bases
#> JQ003570.1 Fragil... ATTCCACACTCATACTTAAAAACATTCCAAGGTCCAGCAACTGGTATCATCGTAGAACGT... + 516 bases
#> JQ003569.1 Synedr... AAAAACTACGGTCGTGTAGTATATGAAGGTTTAAAAGGTGGTTTAGACTTCTTAAAAGAT... + 390 bases
#> ... with 29 more sequences.
According to the output, we have a DNA vector with 41 sequences. Those sequences were retrieved directly from NCBI. Therefore, they are not aligned and they have different lenght. It is easy to get the range of lengths with the following command:
As explained in the introduction, we want to analyze the data in a
dataframe as it present several advantages. The bioseq
package has been designed to work smoothly with tibbles
, an
extension of the standard dataframe. Below, we use the
as_tibble
function to turn the DNA vector into a
tibble
with two columns (one for the names and one for the
sequences). We will also clean the names and split them into two
columns. Finally we create an extra column with the length (number of
characters) of each sequence.
fra_data <- fra_data %>%
mutate(genbank_id = str_extract(label, "([^\\s]+)"),
taxa = str_extract(label, "(?<= ).*")) %>%
select(genbank_id, taxa, sequence)
fra_data <- fra_data %>%
mutate(n_base = seq_nchar(sequence))
fra_data
#> # A tibble: 41 × 4
#> genbank_id taxa sequence n_base
#> <chr> <chr> <DNA> <int>
#> 1 MK576039.1 Fragilaria MK576039.1 TTCATCGCTTATGAATGTGATTTATTTGAAGAAGG… 635
#> 2 MK672908.1 Fragilaria joachimii GCTTTCATCGCTTACGAATGTGATTTATTTGAAGA… 1167
#> 3 MK672907.1 Fragilaria heatherae ATGTCTCAATCTGTATCAGAACGGACTCGAATCAA… 1446
#> 4 MK672906.1 Fragilaria agnesiae ATGTCTCAATCTGTATCAGAACGGACTCGAATCAA… 1446
#> 5 KT072939.1 Fragilaria rumpens TTAGCATTATTCCGTATTACTCCACAACCAGGTGT… 804
#> 6 KT072928.1 Fragilaria capucina TTAGCATTATTCCGTATTACTCCACAACCAGGTGT… 804
#> 7 KT072903.1 Fragilaria crotonensis TTAGCATTATTCCGTATTACTCCACAACCAGGTGT… 804
#> 8 MF092945.1 Fragilaria MF092945.1 AGTGACCGTTACGAATCTGGTGTAATCCCTTACGC… 1323
#> 9 MF092943.1 Fragilaria MF092943.1 AGTGACCGTTACGAATCTGGTGTAATCCCTTACGC… 1323
#> 10 JQ003574.1 Synedra radians AAAAANTATGGGCGTGTAGTATACGAAGGTTTAAA… 450
#> # ℹ 31 more rows
In this example we are only interested in a particular region of the sequences, a short fragment used in many diatom metabarcoding studies. To extract the region of interest we can provide two nucleotide position but this solution require the sequences to be aligned. The alternative solution is to crop using nucleotide patterns. Here, we will use the forward and reverse primers delimiting the barcode region to crop the sequences. First, we create two DNA vectors one for the forward and one for the reverse primers:
FWD <- dna("AGGTGAAGTAAAAGGTTCWTACTTAAA",
"AGGTGAAGTTAAAGGTTCWTAYTTAAA",
"AGGTGAAACTAAAGGTTCWTACTTAAA")
REV <- dna("CAGTWGTWGGTAAATTAGAAGG",
"CTGTTGTWGGTAAATTAGAAGG")
You may have noticed that some characters of the primers are
ambiguous which means they can represent several nucleotides. To get all
the combinations represented by those ambiguous primer sequences we can
use the seq_disambiguate_IUPAC()
function. For example for
the forward primers:
seq_disambiguate_IUPAC(FWD)
#> [[1]]
#> DNA vector of 2 sequences
#> > AGGTGAAGTAAAAGGTTCATACTTAAA
#> > AGGTGAAGTAAAAGGTTCTTACTTAAA
#>
#> [[2]]
#> DNA vector of 4 sequences
#> > AGGTGAAGTTAAAGGTTCATACTTAAA
#> > AGGTGAAGTTAAAGGTTCTTACTTAAA
#> > AGGTGAAGTTAAAGGTTCATATTTAAA
#> > AGGTGAAGTTAAAGGTTCTTATTTAAA
#>
#> [[3]]
#> DNA vector of 2 sequences
#> > AGGTGAAACTAAAGGTTCATACTTAAA
#> > AGGTGAAACTAAAGGTTCTTACTTAAA
We can see that the three ambiguous sequences used as forward primers
actually represent eight different sequences. Fortunately, the functions
of bioseq
can automatically recognize ambiguous nucleotides
and interpret them as such. This is why it is always recommended to
construct DNA/RNA/AA vectors instead of simple character vector when it
is possible.
Now we crop the sequences using the forward and reverse primers as delimiters. We can expect that the returned sequences will be aligned because they correspond to a very specific region with a fixed length.
fra_data <- fra_data %>%
mutate(barcode = seq_crop_pattern(sequence,
pattern_in = list(FWD),
pattern_out = list(REV)))
fra_data
#> # A tibble: 41 × 5
#> genbank_id taxa sequence n_base barcode
#> <chr> <chr> <DNA> <int> <DNA>
#> 1 MK576039.1 Fragilaria MK576039.1 TTCATCGCTTATGAATGTGATTTATTT… 635 NA
#> 2 MK672908.1 Fragilaria joachimii GCTTTCATCGCTTACGAATGTGATTTA… 1167 AGGTGA…
#> 3 MK672907.1 Fragilaria heatherae ATGTCTCAATCTGTATCAGAACGGACT… 1446 AGGTGA…
#> 4 MK672906.1 Fragilaria agnesiae ATGTCTCAATCTGTATCAGAACGGACT… 1446 AGGTGA…
#> 5 KT072939.1 Fragilaria rumpens TTAGCATTATTCCGTATTACTCCACAA… 804 NA
#> 6 KT072928.1 Fragilaria capucina TTAGCATTATTCCGTATTACTCCACAA… 804 NA
#> 7 KT072903.1 Fragilaria crotonensis TTAGCATTATTCCGTATTACTCCACAA… 804 NA
#> 8 MF092945.1 Fragilaria MF092945.1 AGTGACCGTTACGAATCTGGTGTAATC… 1323 AGGTGA…
#> 9 MF092943.1 Fragilaria MF092943.1 AGTGACCGTTACGAATCTGGTGTAATC… 1323 AGGTGA…
#> 10 JQ003574.1 Synedra radians AAAAANTATGGGCGTGTAGTATACGAA… 450 NA
#> # ℹ 31 more rows
As expected, the barcode sequences seem to be aligned. We can also see that this operation has produced several NAs which correspond to sequences where the primers were not detected, probably because the barcode region was not included in the sequence. We could simply exclude the NA values, but it is safer to filter only the sequences whose length match exactly the expected length of the barcode (i.e. 312 bp).
fra_data <- fra_data %>%
filter(seq_nchar(barcode) == 312)
fra_data
#> # A tibble: 23 × 5
#> genbank_id taxa sequence n_base barcode
#> <chr> <chr> <DNA> <int> <DNA>
#> 1 MK672908.1 Fragilaria joachimii GCTTTCATCGCTTACGAATGTGATTTAT… 1167 AGGTGA…
#> 2 MK672907.1 Fragilaria heatherae ATGTCTCAATCTGTATCAGAACGGACTC… 1446 AGGTGA…
#> 3 MK672906.1 Fragilaria agnesiae ATGTCTCAATCTGTATCAGAACGGACTC… 1446 AGGTGA…
#> 4 MF092945.1 Fragilaria MF092945.1 AGTGACCGTTACGAATCTGGTGTAATCC… 1323 AGGTGA…
#> 5 MF092943.1 Fragilaria MF092943.1 AGTGACCGTTACGAATCTGGTGTAATCC… 1323 AGGTGA…
#> 6 EU090046.1 Fragilaria EU090046.1 TGACCGTTACGAATCTGGTGTAATCCCT… 1330 AGGTGA…
#> 7 EU090045.1 Fragilaria EU090045.1 TGACCGTTACGAATCTGGTGTAATCCCT… 1330 AGGTGA…
#> 8 EU090044.1 Fragilaria EU090044.1 TGACCGTTACGAATCTGGTGTAATCCCT… 1330 AGGTGA…
#> 9 EU090043.1 Fragilaria EU090043.1 TGACCGTTACGAATCTGGTGTAATCCCT… 1328 AGGTGA…
#> 10 EU090042.1 Fragilaria EU090042.1 ACCGTTACGAATCTGGTGTAATCCCTTA… 1329 AGGTGA…
#> # ℹ 13 more rows
We now want to reconstruct a phylogenetic tree for the sequences. We
can include all the sequences in the tree but for clarity we want keep
only one sequence per taxa. Instead of selecting one sequence randomly
we compute a consensus sequence for each taxa using the function
seq_consensus()
. There are different methods to compute a
consensus sequence implemented in the function, you can check the manual
for details. The default method used below uses the majority rule.
fra_consensus <- fra_data %>%
group_by(taxa) %>%
summarise(consensus_barcode = seq_consensus(barcode))
fra_consensus
#> # A tibble: 20 × 2
#> taxa consensus_barcode
#> <chr> <DNA>
#> 1 Fragilaria EU090039.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAG…
#> 2 Fragilaria EU090041.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAG…
#> 3 Fragilaria EU090042.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAG…
#> 4 Fragilaria EU090043.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAG…
#> 5 Fragilaria EU090044.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAG…
#> 6 Fragilaria EU090045.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAG…
#> 7 Fragilaria EU090046.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAG…
#> 8 Fragilaria KC969741.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAG…
#> 9 Fragilaria KC969742.1 AGGTGAAACTAAAGGTTCTTACTTAAACGTTACAGCTGCTACTATGGAAGATG…
#> 10 Fragilaria MF092943.1 AGGTGAAGTTAAAGGTTCTTACTTAAACATTACTGCTGCTACTATGGATGAAC…
#> 11 Fragilaria MF092945.1 AGGTGAAGTTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAC…
#> 12 Fragilaria agnesiae AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAAC…
#> 13 Fragilaria bidens AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAAC…
#> 14 Fragilaria crotonensis AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAAC…
#> 15 Fragilaria famelica AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAAG…
#> 16 Fragilaria heatherae AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAGC…
#> 17 Fragilaria joachimii AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAAC…
#> 18 Fragilaria perminuta AGGTGAAGTTAAAGGTTCTTACTTAAACATTACTGCTGCTACTATGGAAGAAC…
#> 19 Fragilaria rumpens AGGTGAAGTTAAAGGTTCTTACTTAAACATCACGGCTGCTACTATGGAAGAAC…
#> 20 Fragilaria striatula AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAG…
Next, we will use ape
to do a quick phylogenetic
reconstruction of these sequences. We can use the function
as_DNAbin
to convert a tibble into a DNAbin
object and then use the functions of ape
to reconstruct and
plot a phylogenetic tree with the distance-based BIONJ algorithm.
The tree above suggests that several sequences are exactly the same. This can be quickly assessed:
duplicated(fra_consensus$consensus_barcode)
#> [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Indeed, several sequences are duplicates and we may want to keep only
unique sequences. We could select only unique sequence but here we
rather use the seq_cluster
function to cluster sequences
based on their similarity.
fra_consensus <-
fra_consensus %>%
mutate(cluster = seq_cluster(consensus_barcode,
threshold = 0.001))
fra_consensus
#> # A tibble: 20 × 3
#> taxa consensus_barcode cluster
#> <chr> <DNA> <int>
#> 1 Fragilaria EU090039.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTAT… 1
#> 2 Fragilaria EU090041.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTAT… 2
#> 3 Fragilaria EU090042.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTAT… 2
#> 4 Fragilaria EU090043.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTAT… 2
#> 5 Fragilaria EU090044.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTAT… 2
#> 6 Fragilaria EU090045.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTAT… 2
#> 7 Fragilaria EU090046.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTAT… 2
#> 8 Fragilaria KC969741.1 AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTAT… 1
#> 9 Fragilaria KC969742.1 AGGTGAAACTAAAGGTTCTTACTTAAACGTTACAGCTGCTACTAT… 3
#> 10 Fragilaria MF092943.1 AGGTGAAGTTAAAGGTTCTTACTTAAACATTACTGCTGCTACTAT… 4
#> 11 Fragilaria MF092945.1 AGGTGAAGTTAAAGGTTCTTACTTAAACATCACAGCTGCTACTAT… 5
#> 12 Fragilaria agnesiae AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTAT… 6
#> 13 Fragilaria bidens AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTAT… 7
#> 14 Fragilaria crotonensis AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTAT… 8
#> 15 Fragilaria famelica AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTAT… 9
#> 16 Fragilaria heatherae AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTAT… 10
#> 17 Fragilaria joachimii AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTAT… 11
#> 18 Fragilaria perminuta AGGTGAAGTTAAAGGTTCTTACTTAAACATTACTGCTGCTACTAT… 12
#> 19 Fragilaria rumpens AGGTGAAGTTAAAGGTTCTTACTTAAACATCACGGCTGCTACTAT… 13
#> 20 Fragilaria striatula AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTAT… 1
Finally we recompute a consensus sequence for each cluster and we reconstruct the phylogenetic tree.
fra_consensus <-
fra_consensus %>%
group_by(cluster) %>%
summarise(taxa_group = paste(taxa, collapse = "/"),
consensus_barcode = seq_consensus(consensus_barcode))
fra_consensus %>%
as_DNAbin(consensus_barcode, taxa_group) %>%
ape::dist.dna() %>%
ape::bionj() %>%
plot()
It is important to note that this tree is not an accurate phylogenetic reconstruction but it can help to visualise the hierarchical relationships between the sequences. This is particularly useful with larger datasets to detect polyphyletic clades unresolved by the barcode region.
When we are happy with the results, we usually want to export the
data. Here, we could export directly the dataframe using
write_csv()
. However, it might be more convenient to export
sequences in FASTA to open them in an other software later. This can be
achieved using the function write_fasta()
.