Title: | A Toolbox for Manipulating Biological Sequences |
---|---|
Description: | Classes and functions to work with biological sequences (DNA, RNA and amino acid sequences). Implements S3 infrastructure to work with biological sequences as described in Keck (2020) <doi:10.1111/2041-210X.13490>. Provides a collection of functions to perform biological conversion among classes (transcription, translation) and basic operations on sequences (detection, selection and replacement based on positions or patterns). The package also provides functions to import and export sequences from and to other package formats. |
Authors: | Francois Keck [aut, cre, cph] |
Maintainer: | Francois Keck <[email protected]> |
License: | GPL-3 |
Version: | 0.1.4 |
Built: | 2024-10-29 03:46:21 UTC |
Source: | https://github.com/fkeck/bioseq |
aa()
build a AA vector from a character vector.
aa(...)
aa(...)
... |
character to turn into AA. Can be a set of name-value pairs. |
vector of class bioseq_aa
aa("AGGTGC", "TTCGA") aa(Seq_1 = "AGGTGC", Seq_2 = "TTCGA") x <- c("AGGTGC", "TTCGA") aa(x)
aa("AGGTGC", "TTCGA") aa(Seq_1 = "AGGTGC", Seq_2 = "TTCGA") x <- c("AGGTGC", "TTCGA") aa(x)
This function uses AliView (Larsson, 2014) to visualize DNA sequences. The software must be installed on the computer.
aliview( x, aliview_exec = getOption("bioseq.aliview.exec", default = "aliview") )
aliview( x, aliview_exec = getOption("bioseq.aliview.exec", default = "aliview") )
x |
a DNA, RNA or AA vector.
Alternatively a |
aliview_exec |
a character string giving the path of the program. |
By default, the function assumes that the executable is installed
in a directory located on the PATH. Alternatively the user can provide
an absolute path to the executable (i.e. the location where the software
was installed/uncompressed). This information can be stored in the global
options settings using
options(bioseq.aliview.exec = "my_path_to_aliview")
.
Larsson, A. (2014). AliView: a fast and lightweight alignment viewer and editor for large data sets. Bioinformatics 30(22): 3276-3278.
Other GUI wrappers:
seaview()
List of the allowed characters for each type of sequences.
A C G T W S M K R Y B D H V N -
A C G U W S M K R Y B D H V N -
A C D E F G H I K L M N P Q R S T V W Y B X Z J U O * -
Nomenclature Committee of the International Union of Biochemistry (NC-IUB) (1986). Proc. Natl. Acad. Sci. USA. 83 (1): 4–8.
Nomenclature and Symbolism for Amino Acids and Peptides. IUPAC-IUB Joint Commission on Biochemical Nomenclature. 1983.
Coercion to an amino acid (AA) vector
as_aa(x)
as_aa(x)
x |
An object to coerce. |
An amino acid vector of class bioseq_aa
Other conversions:
as-tibble-ape
,
as-tibble-bioseq
,
as_AAbin()
,
as_DNAbin()
,
as_dna()
,
as_rna()
,
as_seqinr_alignment()
Coerce to AAbin
as_AAbin(x, ...)
as_AAbin(x, ...)
x |
An object. |
... |
Other parameters. |
An AAbin object.
Other conversions:
as-tibble-ape
,
as-tibble-bioseq
,
as_DNAbin()
,
as_aa()
,
as_dna()
,
as_rna()
,
as_seqinr_alignment()
Coerce tibble to AAbin
## S3 method for class 'tbl_df' as_AAbin(x, sequences, labels = NULL, ...)
## S3 method for class 'tbl_df' as_AAbin(x, sequences, labels = NULL, ...)
x |
a tibble. |
sequences |
Name of the tibble column that stores the sequences. |
labels |
Name of the tibble column that stores the sequence labels. |
... |
Other params. |
An AAbin object.
Coercion to DNA vector
as_dna(x)
as_dna(x)
x |
An object to coerce. |
A DNA vector of class bioseq_dna
Other conversions:
as-tibble-ape
,
as-tibble-bioseq
,
as_AAbin()
,
as_DNAbin()
,
as_aa()
,
as_rna()
,
as_seqinr_alignment()
Coerce to DNAbin
as_DNAbin(x, ...)
as_DNAbin(x, ...)
x |
An object. |
... |
Other parameters. |
A DNAbin object.
Other conversions:
as-tibble-ape
,
as-tibble-bioseq
,
as_AAbin()
,
as_aa()
,
as_dna()
,
as_rna()
,
as_seqinr_alignment()
Coerce tibble to DNAbin
## S3 method for class 'tbl_df' as_DNAbin(x, sequences, labels = NULL, ...)
## S3 method for class 'tbl_df' as_DNAbin(x, sequences, labels = NULL, ...)
x |
a tibble. |
sequences |
Name of the tibble column that stores the sequences. |
labels |
Name of the tibble column that stores the sequence labels. |
... |
Other params. |
A DNAbin object.
Coercion to RNA vector
as_rna(x)
as_rna(x)
x |
An object to coerce. |
A RNA vector of class bioseq_rna
Other conversions:
as-tibble-ape
,
as-tibble-bioseq
,
as_AAbin()
,
as_DNAbin()
,
as_aa()
,
as_dna()
,
as_seqinr_alignment()
Coerce to seqinr alignment
as_seqinr_alignment(x, ...)
as_seqinr_alignment(x, ...)
x |
An object. |
... |
Other parameters. |
An alignment object.
Other conversions:
as-tibble-ape
,
as-tibble-bioseq
,
as_AAbin()
,
as_DNAbin()
,
as_aa()
,
as_dna()
,
as_rna()
These methods convert sequences from ape formats DNAbin and AAbin to tibbles.
as_tibble.DNAbin(x, label = "label", sequence = "sequence", ...) as_tibble.AAbin(x, label = "label", sequence = "sequence", ...)
as_tibble.DNAbin(x, label = "label", sequence = "sequence", ...) as_tibble.AAbin(x, label = "label", sequence = "sequence", ...)
x |
a DNAbin or AAbin object. |
label |
Name of the column that stores the sequence labels in the returned tibble. |
sequence |
Name of the column that stores the sequences in the returned tibble. |
... |
Not used. |
A tibble with two columns (if name is not NULL, the default) or one column (otherwise).
Other conversions:
as-tibble-bioseq
,
as_AAbin()
,
as_DNAbin()
,
as_aa()
,
as_dna()
,
as_rna()
,
as_seqinr_alignment()
require(ape) require(tibble) x <- rDNAbin(nrow = 10, ncol = 25) as_tibble.DNAbin(x)
require(ape) require(tibble) x <- rDNAbin(nrow = 10, ncol = 25) as_tibble.DNAbin(x)
Convert bioseq DNA, RNA and AA to tibble
as_tibble.bioseq_dna(x, label = "label", sequence = "sequence", ...) as_tibble.bioseq_rna(x, label = "label", sequence = "sequence", ...) as_tibble.bioseq_aa(x, label = "label", sequence = "sequence", ...)
as_tibble.bioseq_dna(x, label = "label", sequence = "sequence", ...) as_tibble.bioseq_rna(x, label = "label", sequence = "sequence", ...) as_tibble.bioseq_aa(x, label = "label", sequence = "sequence", ...)
x |
a DNA, RNA or AA vector. |
label |
Name of the column that stores the sequence labels in the returned tibble. |
sequence |
Name of the column that stores the sequences in the returned tibble. |
... |
Not used. |
A tibble with two columns (if name is not NULL, the default) or one column (otherwise).
Other conversions:
as-tibble-ape
,
as_AAbin()
,
as_DNAbin()
,
as_aa()
,
as_dna()
,
as_rna()
,
as_seqinr_alignment()
require(tibble) x <- dna(A = "ACGTTAGTGTAGCCGT", B = "CTCGAAATGA", C = NA) as_tibble(x)
require(tibble) x <- dna(A = "ACGTTAGTGTAGCCGT", B = "CTCGAAATGA", C = NA) as_tibble(x)
The function returns a list of named vectors with Start, Stop and Full_name attributes.
dic_genetic_codes()
dic_genetic_codes()
A list of genetic code tables for DNA/RNA translation.
dna()
build a DNA vector from a character vector.
dna(...)
dna(...)
... |
characters to turn into DNA. Can be a set of name-value pairs. |
a vector of class bioseq_dna
dna("AGGTGC", "TTCGA") dna(Seq_1 = "AGGTGC", Seq_2 = "TTCGA") x <- c("AGGTGC", "TTCGA") dna(x)
dna("AGGTGC", "TTCGA") dna(Seq_1 = "AGGTGC", Seq_2 = "TTCGA") x <- c("AGGTGC", "TTCGA") dna(x)
An unparsed FASTA of DNA sequences (rbcL) for various strains of Fragilaria retrieved from NCBI.
fragilaria
fragilaria
A long character vector (unparsed FASTA).
GenBank https://www.ncbi.nlm.nih.gov/genbank/ using the following search term: "(rbcl) AND Fragilaria"
read_fasta
to parse these data.
List of all genetic code tables available in bioseq
.
The number in bold can be used to select a table in appropriate functions.
1.
Standard
2.
Vertebrate Mitochondrial
3.
Yeast Mitochondrial
4.
Mold Mitochondrial; Protozoan Mitochondrial;
Coelenterate Mitochondrial; Mycoplasma; Spiroplasma
5.
Invertebrate Mitochondrial
6.
Ciliate Nuclear; Dasycladacean Nuclear;
Hexamita Nuclear
9.
Echinoderm Mitochondrial; Flatworm Mitochondrial
10.
Euplotid Nuclear
11.
Bacterial, Archaeal and Plant Plastid
12.
Alternative Yeast Nuclear
13.
Ascidian Mitochondrial
14.
Alternative Flatworm Mitochondrial
15.
Blepharisma Macronuclear
16.
Chlorophycean Mitochondrial
21.
Trematode Mitochondrial
22.
Scenedesmus obliquus Mitochondrial
23.
Thraustochytrium Mitochondrial
24.
Pterobranchia Mitochondrial
25.
Candidate Division SR1 and Gracilibacteria
26.
Pachysolen tannophilus Nuclear
27.
Karyorelict Nuclear
28.
Condylostoma Nuclear
29.
Mesodinium Nuclear
30.
Peritrich Nuclear
31.
Blastocrithidia Nuclear
32.
Balanophoraceae Plastid
33.
Cephalodiscidae Mitochondrial
Andrzej (Anjay) Elzanowski and Jim Ostell at National Center for Biotechnology Information (NCBI), Bethesda, Maryland, U.S.A. https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes
This function returns TRUE for objects of class bioseq_aa
is_aa(x)
is_aa(x)
x |
An object. |
Logical.
x <- c("AGGTGC", "TTCGA") is_aa(x) y <- aa(x) is_aa(x)
x <- c("AGGTGC", "TTCGA") is_aa(x) y <- aa(x) is_aa(x)
This function returns TRUE for objects of class bioseq_dna
is_dna(x)
is_dna(x)
x |
An object. |
Logical.
x <- c("AGGTGC", "TTCGA") is_dna(x) y <- dna(x) is_dna(y)
x <- c("AGGTGC", "TTCGA") is_dna(x) y <- dna(x) is_dna(y)
This function returns TRUE for objects of class bioseq_rna
is_rna(x)
is_rna(x)
x |
An object. |
Logical.
x <- c("AGGTGC", "TTCGA") is_rna(x) y <- rna(x) is_rna(x)
x <- c("AGGTGC", "TTCGA") is_rna(x) y <- rna(x) is_rna(x)
Amino acid (AA) vector constructor
new_aa(x = character())
new_aa(x = character())
x |
a character vector. |
DNA vector constructor
new_dna(x = character())
new_dna(x = character())
x |
a character vector. |
RNA vector constructor
new_rna(x = character())
new_rna(x = character())
x |
a character vector. |
Read sequences in FASTA format
read_fasta(file, type = "DNA")
read_fasta(file, type = "DNA")
file |
A path to a file, a connection or a character string. |
type |
Type of data. Can be "DNA" (the default), "RNA" or "AA". |
A DNA, RNA or AA vector (depending on type
argument).
Other input/output operations:
write_fasta()
Reverse and complement sequences
seq_complement(x) seq_reverse(x)
seq_complement(x) seq_reverse(x)
x |
a DNA or RNA vector.
Function |
A reverse or complement sequence (same class as the input).
Other biological operations:
seq_rev_translate()
,
seq_translate()
,
transcription
x <- dna("ACTTTGGCTAAG") seq_reverse(x) seq_complement(x)
x <- dna("ACTTTGGCTAAG") seq_reverse(x) seq_complement(x)
rna()
build a RNA vector from a character vector.
rna(...)
rna(...)
... |
characters to turn into RNA. Can be a set of name-value pairs. |
a vector of class bioseq_rna
rna("AGGUGC", "UUCGA") rna(Seq_1 = "AGGUGC", Seq_2 = "UUCGA") x <- c("AGGTGC", "TTCGA") rna(x)
rna("AGGUGC", "UUCGA") rna(Seq_1 = "AGGUGC", Seq_2 = "UUCGA") x <- c("AGGTGC", "TTCGA") rna(x)
This function opens SeaView (Gouy, Guindon & Gascuel, 2010) to visualize biological sequences and phylogenetic trees. The software must be installed on the computer.
seaview( x, seaview_exec = getOption("bioseq.seaview.exec", default = "seawiew") )
seaview( x, seaview_exec = getOption("bioseq.seaview.exec", default = "seawiew") )
x |
a DNA, RNA or AA vector.
Alternatively a |
seaview_exec |
a character string giving the path of the program. |
By default, the function assumes that the executable is installed
in a directory located on the PATH. Alternatively the user can provide
an absolute path to the executable (i.e. the location where the software
was installed/uncompressed). This can be stored in the global
options settings using
options(bioseq.seaview.exec = "my_path_to_seaview")
.
Gouy M., Guindon S. & Gascuel O. (2010) SeaView version 4 : a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Molecular Biology and Evolution 27(2):221-224.
Other GUI wrappers:
aliview()
Cluster sequences by similarity
seq_cluster(x, threshold = 0.05, method = "complete")
seq_cluster(x, threshold = 0.05, method = "complete")
x |
a DNA, RNA or AA vector of sequences to clustered. |
threshold |
Threshold value (range in [0, 1]). |
method |
the clustering method (see details). |
The function uses ape dist.dna
and
dist.aa
functions to compute pairwise distances among sequences and
hclust
for clustering.
Computing a full pairwise diastance matrix can be computationally expensive. It is recommended to use this function for moderate size dataset.
Supported methods are:
"single"
(= Nearest Neighbour Clustering)
"complete"
(= Farthest Neighbour Clustering)
"average"
(= UPGMA)
"mcquitty"
(= WPGMA)
An integer vector with group memberships.
Function seq_consensus
to compute consensus
and representative sequences for clusters.
Other aggregation operations:
seq_consensus()
x <- c("-----TACGCAGTAAAAGCTACTGATG", "CGTCATACGCAGTAAAAACTACTGATG", "CTTCATACGCAGTAAAAACTACTGATG", "CTTCATATGCAGTAAAAACTACTGATG", "CTTCATACGCAGTAAAAACTACTGATG", "CGTCATACGCAGTAAAAGCTACTGATG", "CTTCATATGCAGTAAAAGCTACTGACG") x <- dna(x) seq_cluster(x)
x <- c("-----TACGCAGTAAAAGCTACTGATG", "CGTCATACGCAGTAAAAACTACTGATG", "CTTCATACGCAGTAAAAACTACTGATG", "CTTCATATGCAGTAAAAACTACTGATG", "CTTCATACGCAGTAAAAACTACTGATG", "CGTCATACGCAGTAAAAGCTACTGATG", "CTTCATATGCAGTAAAAGCTACTGACG") x <- dna(x) seq_cluster(x)
Combine multiple sequences
seq_combine(..., sep = "", collapse = NULL)
seq_combine(..., sep = "", collapse = NULL)
... |
One or more vectors of sequences (DNA, RNA, AA). They must all be of the same type. Short vectors are recycled. |
sep |
String to insert between input vectors. |
collapse |
If not |
The strings sep
and collapse
w ill be coerced to
the type of input vectors with a warning if some character have to replaced.
A vector of sequences (if collapse is NULL
).
A vector with a single sequence, otherwise.
stri_join
from stringi and
str_c
from stringr
for the underlying implementation.
Other string operations:
seq-replace
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") y <- dna("TTTTTTTT", "AAAAAAAAA") seq_combine(x, y) seq_combine(y, x, sep = "CCCCC") seq_combine(y, x, sep = "CCCCC", collapse = "GGGGG")
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") y <- dna("TTTTTTTT", "AAAAAAAAA") seq_combine(x, y) seq_combine(y, x, sep = "CCCCC") seq_combine(y, x, sep = "CCCCC", collapse = "GGGGG")
Find a consensus sequence for a set of sequences.
seq_consensus(x, method = "chr_majority", weights = NULL, gaps = TRUE)
seq_consensus(x, method = "chr_majority", weights = NULL, gaps = TRUE)
x |
a DNA, RNA or AA vector. |
method |
the consensus method (see Details). |
weights |
an optional numeric vector of same length
as |
gaps |
logical. Should the gaps ("-") taken into account. |
"chr_majority", "chr_ambiguity", "seq_centrality", "seq_majority"
For chr_ambiguity gap character always override other characters. Use gaps = FALSE to ignore gaps.
A consensus sequence
Other aggregation operations:
seq_cluster()
x <- c("-----TACGCAGTAAAAGCTACTGATG", "CGTCATACGCAGTAAAAACTACTGATG", "CTTCATACGCAGTAAAAACTACTGATG", "CTTCATATGCAGTAAAAACTACTGATG", "CTTCATACGCAGTAAAAACTACTGATG", "CGTCATACGCAGTAAAAGCTACTGATG", "CTTCATATGCAGTAAAAGCTACTGACG") x <- dna(x) seq_consensus(x)
x <- c("-----TACGCAGTAAAAGCTACTGATG", "CGTCATACGCAGTAAAAACTACTGATG", "CTTCATACGCAGTAAAAACTACTGATG", "CTTCATATGCAGTAAAAACTACTGATG", "CTTCATACGCAGTAAAAACTACTGATG", "CGTCATACGCAGTAAAAGCTACTGATG", "CTTCATATGCAGTAAAAGCTACTGACG") x <- dna(x) seq_consensus(x)
Count the number of matches in sequences
seq_count_pattern(x, pattern)
seq_count_pattern(x, pattern)
x |
a DNA, RNA or AA vector. |
pattern |
a DNA, RNA or AA vectors (but same as |
An integer vector.
It is important to understand how patterns are treated in bioseq.
Patterns are recycled along the sequences (usually the x
argument).
This means that if a pattern (vector or list) is of length > 1, it will be
replicated until it is the same length as x
.
The reverse is not true and a vector of patterns longer than
a vector of sequences will raise a warning.
Patterns can be DNA, RNA or AA vectors (but they must be from the same class as the sequences they are matched against). If patterns are DNA, RNA or AA vectors, they are disambiguated prior to matching. For example pattern dna("ARG") will match AAG or AGG.
Alternatively, patterns can be a simple character vector containing regular expressions.
Vectors of patterns (DNA, RNA, AA or regex) can also be provided in a list. In that case, each vector of the list will be collapsed prior matching, which means that each vector element will be used as an alternative pattern. For example pattern list(c("AAA", "CCC"), "GG") will match AAA or CCC in the first sequence, GG in the second sequence, AAA or CCC in the third, and so on following the recycling rule.
@section Fuzzy matching:
When max_error
is greater than zero, the function perform
fuzzy matching. Fuzzy matching does not support regular expression.
stri_count
from stringi and
str_count
from stringr
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_count_pattern(x, dna("AAA")) seq_count_pattern(x, "T.G")
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_count_pattern(x, dna("AAA")) seq_count_pattern(x, "T.G")
Crop sequences using delimiting patterns
seq_crop_pattern( x, pattern_in, pattern_out, max_error_in = 0, max_error_out = 0, include_patterns = TRUE )
seq_crop_pattern( x, pattern_in, pattern_out, max_error_in = 0, max_error_out = 0, include_patterns = TRUE )
x |
a DNA, RNA or AA vector to be cropped. |
pattern_in |
patterns defining the beginning (left-side). |
pattern_out |
patterns defining the end (right-side). |
max_error_in , max_error_out
|
numeric values ranging from
0 to 1 and giving the maximum error rate allowed between the
target sequence and |
include_patterns |
logical. Should the matched pattern sequence included in the returned sequences? |
A cropped DNA, RNA or AA vector.
Sequences where patterns are not detected returns NA
.
When max_error_in
or max_error_out
are greater
than zero, the function perform fuzzy matching.
Fuzzy matching does not support regular expression.
It is important to understand how patterns are treated in bioseq.
Patterns are recycled along the sequences (usually the x
argument).
This means that if a pattern (vector or list) is of length > 1, it will be
replicated until it is the same length as x
.
The reverse is not true and a vector of patterns longer than
a vector of sequences will raise a warning.
Patterns can be DNA, RNA or AA vectors (but they must be from the same class as the sequences they are matched against). If patterns are DNA, RNA or AA vectors, they are disambiguated prior to matching. For example pattern dna("ARG") will match AAG or AGG.
Alternatively, patterns can be a simple character vector containing regular expressions.
Vectors of patterns (DNA, RNA, AA or regex) can also be provided in a list. In that case, each vector of the list will be collapsed prior matching, which means that each vector element will be used as an alternative pattern. For example pattern list(c("AAA", "CCC"), "GG") will match AAA or CCC in the first sequence, GG in the second sequence, AAA or CCC in the third, and so on following the recycling rule.
@section Fuzzy matching:
When max_error
is greater than zero, the function perform
fuzzy matching. Fuzzy matching does not support regular expression.
stri_extract
from stringi,
str_extract
from stringr and
afind
from stringdist
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna("ACGTTAAAAAGTGTAGCCCCCGT", "CTCGAAATGA") seq_crop_pattern(x, pattern_in = "AAAA", pattern_out = "CCCC")
x <- dna("ACGTTAAAAAGTGTAGCCCCCGT", "CTCGAAATGA") seq_crop_pattern(x, pattern_in = "AAAA", pattern_out = "CCCC")
Crop sequences between two positions
seq_crop_position(x, position_in = 1, position_out = -1)
seq_crop_position(x, position_in = 1, position_out = -1)
x |
a DNA, RNA or AA vector. |
position_in |
an integer giving the position where to start cropping. |
position_out |
an integer giving the position where to stop cropping. |
A cropped DNA, RNA or AA vector.
stri_sub
from stringi and
str_sub
from stringr
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna("ACGTTAGTGTAGCCGT") # Drop the first 3 nucleotides (ACG) seq_crop_position(x, position_in = 4) # Crop codon between position 4 and 6 seq_crop_position(x, position_in = 4, position_out = 6)
x <- dna("ACGTTAGTGTAGCCGT") # Drop the first 3 nucleotides (ACG) seq_crop_position(x, position_in = 4) # Crop codon between position 4 and 6 seq_crop_position(x, position_in = 4, position_out = 6)
Detect the presence of patterns in sequences
seq_detect_pattern(x, pattern, max_error = 0)
seq_detect_pattern(x, pattern, max_error = 0)
x |
a DNA, RNA or AA vector. |
pattern |
a DNA, RNA or AA vectors (but same as |
max_error |
numeric value ranging from 0 to 1 and giving the maximum error rate allowed between the target sequence and the pattern. Error rate is relative to the length of the pattern. |
A logical vector.
It is important to understand how patterns are treated in bioseq.
Patterns are recycled along the sequences (usually the x
argument).
This means that if a pattern (vector or list) is of length > 1, it will be
replicated until it is the same length as x
.
The reverse is not true and a vector of patterns longer than
a vector of sequences will raise a warning.
Patterns can be DNA, RNA or AA vectors (but they must be from the same class as the sequences they are matched against). If patterns are DNA, RNA or AA vectors, they are disambiguated prior to matching. For example pattern dna("ARG") will match AAG or AGG.
Alternatively, patterns can be a simple character vector containing regular expressions.
Vectors of patterns (DNA, RNA, AA or regex) can also be provided in a list. In that case, each vector of the list will be collapsed prior matching, which means that each vector element will be used as an alternative pattern. For example pattern list(c("AAA", "CCC"), "GG") will match AAA or CCC in the first sequence, GG in the second sequence, AAA or CCC in the third, and so on following the recycling rule.
@section Fuzzy matching:
When max_error
is greater than zero, the function perform
fuzzy matching. Fuzzy matching does not support regular expression.
stri_detect
from stringi,
str_detect
from stringr and
afind
from stringdist
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna(c("ACGTTAGTGTAGCCGT", "CTCGAAATGA")) seq_detect_pattern(x, dna(c("CCG", "AAA"))) # Regular expression seq_detect_pattern(x, "^A.{2}T") # Fuzzy matching seq_detect_pattern(x, dna("AGG"), max_error = 0.2) # No match. The pattern has three character, the max_error # has to be > 1/3 to allow one character difference. seq_detect_pattern(x, dna("AGG"), max_error = 0.4) # Match
x <- dna(c("ACGTTAGTGTAGCCGT", "CTCGAAATGA")) seq_detect_pattern(x, dna(c("CCG", "AAA"))) # Regular expression seq_detect_pattern(x, "^A.{2}T") # Fuzzy matching seq_detect_pattern(x, dna("AGG"), max_error = 0.2) # No match. The pattern has three character, the max_error # has to be > 1/3 to allow one character difference. seq_detect_pattern(x, dna("AGG"), max_error = 0.4) # Match
This function finds all the combinations of sequences corresponding to a given vector of sequences with ambiguities (IUPAC codes).
seq_disambiguate_IUPAC(x)
seq_disambiguate_IUPAC(x)
x |
a DNA, RNA or AA vector |
A list of DNA, RNA or AA vectors (depending on the input) giving all possible combinations.
Other op-misc:
seq_nchar()
,
seq_nseq()
,
seq_spellout()
,
seq_stat_gc()
,
seq_stat_prop()
x <- dna(c("AYCTGW", "CTTN")) seq_disambiguate_IUPAC(x) y <- seq_transcribe(x) seq_disambiguate_IUPAC(y) z <- aa("YJSNAALNX") z <- seq_translate(y) seq_disambiguate_IUPAC(z)
x <- dna(c("AYCTGW", "CTTN")) seq_disambiguate_IUPAC(x) y <- seq_transcribe(x) seq_disambiguate_IUPAC(y) z <- aa("YJSNAALNX") z <- seq_translate(y) seq_disambiguate_IUPAC(z)
Extract matching patterns from sequences
seq_extract_pattern(x, pattern)
seq_extract_pattern(x, pattern)
x |
a DNA, RNA or AA vector. |
pattern |
a DNA, RNA or AA vectors (but same as |
A list of vectors of same class as x
.
It is important to understand how patterns are treated in bioseq.
Patterns are recycled along the sequences (usually the x
argument).
This means that if a pattern (vector or list) is of length > 1, it will be
replicated until it is the same length as x
.
The reverse is not true and a vector of patterns longer than
a vector of sequences will raise a warning.
Patterns can be DNA, RNA or AA vectors (but they must be from the same class as the sequences they are matched against). If patterns are DNA, RNA or AA vectors, they are disambiguated prior to matching. For example pattern dna("ARG") will match AAG or AGG.
Alternatively, patterns can be a simple character vector containing regular expressions.
Vectors of patterns (DNA, RNA, AA or regex) can also be provided in a list. In that case, each vector of the list will be collapsed prior matching, which means that each vector element will be used as an alternative pattern. For example pattern list(c("AAA", "CCC"), "GG") will match AAA or CCC in the first sequence, GG in the second sequence, AAA or CCC in the third, and so on following the recycling rule.
@section Fuzzy matching:
When max_error
is greater than zero, the function perform
fuzzy matching. Fuzzy matching does not support regular expression.
stri_extract
from stringi and
str_extract
from stringr
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_extract_pattern(x, dna("AAA")) seq_extract_pattern(x, "T.G")
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_extract_pattern(x, dna("AAA")) seq_extract_pattern(x, "T.G")
Extract a region between two positions in sequences
seq_extract_position(x, position_in, position_out)
seq_extract_position(x, position_in, position_out)
x |
a DNA, RNA or AA vector. |
position_in |
an integer giving the position where to start to extract. |
position_out |
an integer giving the position where to stop to extract. |
A vector of same class as x
.
stri_extract
from stringi and
str_extract
from stringr
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_extract_position(x, 3, 8)
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_extract_position(x, 3, 8)
Count the number of character in sequences
seq_nchar(x, gaps = TRUE)
seq_nchar(x, gaps = TRUE)
x |
a DNA, RNA or AA vector. |
gaps |
if |
An integer vector giving the size of each sequence of x
.
Other op-misc:
seq_disambiguate_IUPAC()
,
seq_nseq()
,
seq_spellout()
,
seq_stat_gc()
,
seq_stat_prop()
x <- dna(c("ATGCAGA", "GGR-----","TTGCCTAGKTGAACC")) seq_nchar(x) seq_nchar(x, gaps = FALSE)
x <- dna(c("ATGCAGA", "GGR-----","TTGCCTAGKTGAACC")) seq_nchar(x) seq_nchar(x, gaps = FALSE)
This is an alias for length
.
seq_nseq(x)
seq_nseq(x)
x |
a DNA, RNA or AA vector. |
an integer.
Other op-misc:
seq_disambiguate_IUPAC()
,
seq_nchar()
,
seq_spellout()
,
seq_stat_gc()
,
seq_stat_prop()
Remove matched patterns in sequences
seq_remove_pattern(x, pattern)
seq_remove_pattern(x, pattern)
x |
a DNA, RNA or AA vector. |
pattern |
a DNA, RNA or AA vectors (but same as |
A vector of same class as x
.
It is important to understand how patterns are treated in bioseq.
Patterns are recycled along the sequences (usually the x
argument).
This means that if a pattern (vector or list) is of length > 1, it will be
replicated until it is the same length as x
.
The reverse is not true and a vector of patterns longer than
a vector of sequences will raise a warning.
Patterns can be DNA, RNA or AA vectors (but they must be from the same class as the sequences they are matched against). If patterns are DNA, RNA or AA vectors, they are disambiguated prior to matching. For example pattern dna("ARG") will match AAG or AGG.
Alternatively, patterns can be a simple character vector containing regular expressions.
Vectors of patterns (DNA, RNA, AA or regex) can also be provided in a list. In that case, each vector of the list will be collapsed prior matching, which means that each vector element will be used as an alternative pattern. For example pattern list(c("AAA", "CCC"), "GG") will match AAA or CCC in the first sequence, GG in the second sequence, AAA or CCC in the third, and so on following the recycling rule.
@section Fuzzy matching:
When max_error
is greater than zero, the function perform
fuzzy matching. Fuzzy matching does not support regular expression.
str_remove
from stringr
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_remove_pattern(x, dna("AAA")) seq_remove_pattern(x, "^A.{2}T")
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_remove_pattern(x, dna("AAA")) seq_remove_pattern(x, "^A.{2}T")
Remove a region between two positions in sequences.
seq_remove_position(x, position_in, position_out)
seq_remove_position(x, position_in, position_out)
x |
a DNA, RNA or AA vector. |
position_in |
an integer giving the position where to start to remove. |
position_out |
an integer giving the position where to stop to remove. |
A vector of same class as x
.
str_remove
from stringr
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_remove_position(x, 2, 6) seq_remove_position(x, 1:2, 3:4)
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_remove_position(x, 2, 6) seq_remove_position(x, 1:2, 3:4)
Replace a region between two positions in sequences
seq_replace_position(x, position_in, position_out, replacement)
seq_replace_position(x, position_in, position_out, replacement)
x |
a DNA, RNA or AA vector. |
position_in |
an integer giving the position where to start to replace. |
position_out |
an integer giving the position where to stop to replace. |
replacement |
a vector of replacements. |
A vector of same class as x
.
stri_replace
from stringi and
str_replace
from stringr
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_replace_position(x, c(5, 2), 6, "-------")
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_replace_position(x, c(5, 2), 6, "-------")
The function perform reverse translation of amino acid sequences. Such operation does not exist in nature but is provided for completeness. Because of codon degeneracy it is expected to produce many ambiguous nucleotides.
seq_rev_translate(x, code = 1)
seq_rev_translate(x, code = 1)
x |
an amino acid sequence ( |
code |
an integer indicating the genetic code to use for reverse translation (default 1 uses the Standard genetic code). See Details. |
Gaps (-) are interpreted as unknown amino acids (X) but can be
removed prior to the translation with the function seq_remove_gap
.
a vector of DNA sequences.
Other biological operations:
rev_complement
,
seq_translate()
,
transcription
x <- dna("ACTTTGGCTAAG") y <- seq_translate(x) z <- seq_rev_translate(y) z # There is a loss of information during the reverse translation all.equal(x, z)
x <- dna("ACTTTGGCTAAG") y <- seq_translate(x) z <- seq_rev_translate(y) z # There is a loss of information during the reverse translation all.equal(x, z)
This function spells out nucleotides and amino acids in sequences.
seq_spellout(x, short = FALSE, collapse = " - ")
seq_spellout(x, short = FALSE, collapse = " - ")
x |
a DNA, RNA or AA vector |
short |
logical. If TRUE, the function will return 3-letters short names for amino acids (ignored for DNA and RNA). |
collapse |
a character vector to separate the results.
Set to |
A character vector if collapse is not NULL
.
A list of character vectors otherwise.
Other op-misc:
seq_disambiguate_IUPAC()
,
seq_nchar()
,
seq_nseq()
,
seq_stat_gc()
,
seq_stat_prop()
x <- dna("ACGT") seq_spellout(x) x <- rna("ACGU") seq_spellout(x) x <- aa("ACGBTX") seq_spellout(x)
x <- dna("ACGT") seq_spellout(x) x <- rna("ACGU") seq_spellout(x) x <- aa("ACGBTX") seq_spellout(x)
Split sequences into k-mers
seq_split_kmer(x, k)
seq_split_kmer(x, k)
x |
A DNA, RNA or AA vector. |
k |
an integer giving the size of the k-mer. |
a list of k-mer vectors of same class as x
.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_pattern()
x <- dna(a ="ACGTTAGTGTAGCCGT", b = "CTCGAAATGA") seq_split_kmer(x, k = 5)
x <- dna(a ="ACGTTAGTGTAGCCGT", b = "CTCGAAATGA") seq_split_kmer(x, k = 5)
Split sequences
seq_split_pattern(x, pattern)
seq_split_pattern(x, pattern)
x |
a DNA, RNA or AA vector. |
pattern |
a DNA, RNA or AA vectors (but same as |
A list of vectors of same class as x
.
It is important to understand how patterns are treated in bioseq.
Patterns are recycled along the sequences (usually the x
argument).
This means that if a pattern (vector or list) is of length > 1, it will be
replicated until it is the same length as x
.
The reverse is not true and a vector of patterns longer than
a vector of sequences will raise a warning.
Patterns can be DNA, RNA or AA vectors (but they must be from the same class as the sequences they are matched against). If patterns are DNA, RNA or AA vectors, they are disambiguated prior to matching. For example pattern dna("ARG") will match AAG or AGG.
Alternatively, patterns can be a simple character vector containing regular expressions.
Vectors of patterns (DNA, RNA, AA or regex) can also be provided in a list. In that case, each vector of the list will be collapsed prior matching, which means that each vector element will be used as an alternative pattern. For example pattern list(c("AAA", "CCC"), "GG") will match AAA or CCC in the first sequence, GG in the second sequence, AAA or CCC in the third, and so on following the recycling rule.
@section Fuzzy matching:
When max_error
is greater than zero, the function perform
fuzzy matching. Fuzzy matching does not support regular expression.
stri_split
from stringi and
str_split
from stringr
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
x <- dna(a = "ACGTTAGTGTAGCCGT", b = "CTCGAAATGA") seq_split_pattern(x, dna("AAA")) seq_split_pattern(x, "T.G")
x <- dna(a = "ACGTTAGTGTAGCCGT", b = "CTCGAAATGA") seq_split_pattern(x, dna("AAA")) seq_split_pattern(x, "T.G")
Compute G+C content
seq_stat_gc(x)
seq_stat_gc(x)
x |
a DNA or RNA |
Ambiguous characters (other than S and W) are ignored.
A numeric vector of G+C proportions.
Other op-misc:
seq_disambiguate_IUPAC()
,
seq_nchar()
,
seq_nseq()
,
seq_spellout()
,
seq_stat_prop()
x <- dna(c("ATGCAGA", "GGR-----","TTGCCTAGKTGAACC")) seq_stat_gc(x)
x <- dna(c("ATGCAGA", "GGR-----","TTGCCTAGKTGAACC")) seq_stat_gc(x)
Compute proportions for characters
seq_stat_prop(x, gaps = FALSE)
seq_stat_prop(x, gaps = FALSE)
x |
a DNA, RNA or AA vector. |
gaps |
if |
A list of vectors indicating the proportion of characters in each sequence.
Other op-misc:
seq_disambiguate_IUPAC()
,
seq_nchar()
,
seq_nseq()
,
seq_spellout()
,
seq_stat_gc()
x <- dna(c("ATGCAGA", "GGR-----","TTGCCTAGKTGAACC")) seq_stat_prop(x) seq_stat_prop(x, gaps = TRUE)
x <- dna(c("ATGCAGA", "GGR-----","TTGCCTAGKTGAACC")) seq_stat_prop(x) seq_stat_prop(x, gaps = TRUE)
Translate DNA/RNA sequences into amino acids
seq_translate(x, code = 1, codon_frame = 1, codon_init = FALSE)
seq_translate(x, code = 1, codon_frame = 1, codon_init = FALSE)
x |
a vector of DNA (bioseq_dna) or RNA (bioseq_rna). |
code |
an integer indicating the genetic code to use for translation (default 1 uses the Standard genetic code). See Details. |
codon_frame |
an integer giving the nucleotide position where to start translation. |
codon_init |
a logical indicating whether the first codon is evaluated as a possible codon start and translated to methionine. |
Several genetic codes can be used for translation. See genetic-codes to get the list of available genetic codes and their ID number.
Gaps (-) are interpreted as unknown nucleotides (N) but can be
removed prior to the translation with the function seq_remove_gap
.
The function deals with ambiguities on both sides. This means that if ambiguous codons cannot be translated to amino acid, they are translated to the most specific ambiguous amino acids (X in the most extreme case).
An amino acid vector (bioseq_aa
).
Other biological operations:
rev_complement
,
seq_rev_translate()
,
transcription
x <- dna(c("ATGCAGA", "GGR","TTGCCTAGKTGAACC", "AGGNGC", "NNN")) seq_translate(x)
x <- dna(c("ATGCAGA", "GGR","TTGCCTAGKTGAACC", "AGGNGC", "NNN")) seq_translate(x)
Replace matched patterns in sequences
seq_replace_pattern(x, pattern, replacement)
seq_replace_pattern(x, pattern, replacement)
x |
a DNA, RNA or AA vector. |
pattern |
a DNA, RNA or AA vectors (but same as |
replacement |
a vector of replacements. |
A vector of same class as x
.
It is important to understand how patterns are treated in bioseq.
Patterns are recycled along the sequences (usually the x
argument).
This means that if a pattern (vector or list) is of length > 1, it will be
replicated until it is the same length as x
.
The reverse is not true and a vector of patterns longer than
a vector of sequences will raise a warning.
Patterns can be DNA, RNA or AA vectors (but they must be from the same class as the sequences they are matched against). If patterns are DNA, RNA or AA vectors, they are disambiguated prior to matching. For example pattern dna("ARG") will match AAG or AGG.
Alternatively, patterns can be a simple character vector containing regular expressions.
Vectors of patterns (DNA, RNA, AA or regex) can also be provided in a list. In that case, each vector of the list will be collapsed prior matching, which means that each vector element will be used as an alternative pattern. For example pattern list(c("AAA", "CCC"), "GG") will match AAA or CCC in the first sequence, GG in the second sequence, AAA or CCC in the third, and so on following the recycling rule.
@section Fuzzy matching:
When max_error
is greater than zero, the function perform
fuzzy matching. Fuzzy matching does not support regular expression.
stri_replace
from stringi and
str_replace
from stringr
for the underlying implementation.
Other string operations:
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_detect_pattern()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_replace_pattern(x, dna("AAA"), dna("GGGGGG")) seq_replace_pattern(x, "^A.{2}T", "TTTTTT")
x <- dna("ACGTTAGTGTAGCCGT", "CTCGAAATGA") seq_replace_pattern(x, dna("AAA"), dna("GGGGGG")) seq_replace_pattern(x, "^A.{2}T", "TTTTTT")
Transcribe DNA, reverse-transcribe RNA
seq_transcribe(x) seq_rev_transcribe(x)
seq_transcribe(x) seq_rev_transcribe(x)
x |
A vector of DNA for |
A vector of RNA for seq_transcribe
,
a vector of DNA for seq_rev_transcribe
Other biological operations:
rev_complement
,
seq_rev_translate()
,
seq_translate()
Write sequences in FASTA format
write_fasta(x, file, append = FALSE, line_length = 80, block_length = 10)
write_fasta(x, file, append = FALSE, line_length = 80, block_length = 10)
x |
a DNA, RNA or AA vector. |
file |
a path to a file or a connection. |
append |
a logical. If |
line_length |
length (in number of character) of one line
(excluding spaces separating blocks). Use |
block_length |
length (in number of character) of one block.
Use the same value as |
Other input/output operations:
read_fasta()