| Title: | A DNA Reference Library Manager |
|---|---|
| Description: | Reference database manager offering a set of functions to import, organize, clean, filter, audit and export reference genetic data. Provide functions to download sequence data from NCBI GenBank <https://www.ncbi.nlm.nih.gov/genbank/>. Designed as an environment for semi-automatic and assisted construction of reference databases and to improve standardization and repeatability in barcoding and metabarcoding studies. |
| Authors: | Francois Keck [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-3323-4167>) |
| Maintainer: | Francois Keck <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.3 |
| Built: | 2026-05-28 07:02:40 UTC |
| Source: | https://github.com/fkeck/refdb |
Internal check for fields
check_fields(x, what = c("source", "id", "taxonomy", "sequence", "marker"))check_fields(x, what = c("source", "id", "taxonomy", "sequence", "marker"))
x |
a reference database (tibble object). |
what |
a vector of fields to be checked. |
Invisible or error.
Functions to set fields for various databases
refdb_set_fields_NCBI(x) refdb_set_fields_BOLD(x) refdb_set_fields_PR2(x) refdb_set_fields_diatbarcode(x)refdb_set_fields_NCBI(x) refdb_set_fields_BOLD(x) refdb_set_fields_PR2(x) refdb_set_fields_diatbarcode(x)
x |
a reference database. |
The function returns x with updated attributes.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) refdb_set_fields_BOLD(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) refdb_set_fields_BOLD(lib)
Scores for filtering operations
.filter_seq_length(x, gaps).filter_seq_length(x, gaps)
x |
a reference database |
gaps |
should gaps be included. |
a numeric vector
Download and parse NCBI taxonomy records
get_ncbi_taxonomy(id, verbose = TRUE)get_ncbi_taxonomy(id, verbose = TRUE)
id |
A vector of id for records in the NCBI Taxonomy database. |
verbose |
print information in the console. |
A tibble with each row corresponding to an id and each column to a taxonomic level.
Create a graph representation from a taxonomic classification included in a reference database. For this function to work, taxonomic fields must be set.
igraph_from_taxo(x, cols = NULL)igraph_from_taxo(x, cols = NULL)
x |
a reference database (tibble). |
cols |
an optional vector of column names to use a subset of columns. |
An igraph object representing taxonomic relationships.
Parse NCBI XML and make a table
make_ncbi_table(x)make_ncbi_table(x)
x |
A XML nodeset. |
A tibble.
Taxonomic ranks of the NCBI Taxonomy database
ncbi_taxo_rank()ncbi_taxo_rank()
a vector of ordered ranks
Process coordinate column returned by NCBI
process_geo_ncbi(x, col = "lat_lon")process_geo_ncbi(x, col = "lat_lon")
x |
NCBI dataframe. |
col |
column name containing geographical coordinates. |
NCBI dataframe.
Check for conflicts in sequences
refdb_check_seq_conflict(x, na_omit = TRUE)refdb_check_seq_conflict(x, na_omit = TRUE)
x |
a reference database. |
na_omit |
if |
A list of two-columns tibbles reporting duplicated sequences with different taxonomy.
lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) lib <- refdb_set_fields(lib, taxonomy = c(family = "family_name", genus = "genus_name", species = "species_name"), sequence = "DNA_seq", marker = "marker") refdb_check_seq_conflict(lib)lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) lib <- refdb_set_fields(lib, taxonomy = c(family = "family_name", genus = "genus_name", species = "species_name"), sequence = "DNA_seq", marker = "marker") refdb_check_seq_conflict(lib)
This function assesses the genetic similarity among sequences within each taxa. It takes user defined thresholds (one threshold per taxonomic level) to warn about sequences which are singularly different (based on median distance) from the others. Sequences in the reference database must be aligned.
refdb_check_seq_homogeneity(x, levels, min_n_seq = 3)refdb_check_seq_homogeneity(x, levels, min_n_seq = 3)
x |
a reference database (sequences must be aligned). |
levels |
a named vector of genetic similarity thresholds.
Names must correspond to taxonomic levels (taxonomic fields)
and values must be included in the interval [0, 1].
For example to assess homogeneity at 5 percents (within species) and
10 percents (within genus): |
min_n_seq |
the minimum number of sequences for a taxon to be tested. |
For every tested taxonomic levels, the algorithm
checks all sequences in every taxa
(for which the total number of sequence is > min_n_seq)
In each taxon, the pairwise distance matrix among all the sequences
belonging to this taxon is computed. A sequence is tagged as suspicious
and returned by the function
if its median genetic distance from the other sequences is higher than
the threshold set by the user (levels argument).
A dataframe reporting suspicious sequences whose median distance to other sequences of the same taxon is greater than the specified threshold. The first column "level_threshold_homogeneity" indicates the lowest taxonomic level for which the threshold has been exceeded and the second column "value_threshold_homogeneity" gives the computed median distance.
lib <- read.csv(system.file("extdata", "homogeneity.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_check_seq_homogeneity(lib, levels = c(species = 0.05, genus = 0.1))lib <- read.csv(system.file("extdata", "homogeneity.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_check_seq_homogeneity(lib, levels = c(species = 0.05, genus = 0.1))
Check for conflicts in taxonomy
refdb_check_tax_conflict(x)refdb_check_tax_conflict(x)
x |
a reference database. |
A list of two-columns tibbles reporting for each taxonomic level the taxa with identical names but different upstream taxonomy.
lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) lib <- refdb_set_fields(lib, taxonomy = c(family = "family_name", genus = "genus_name", species = "species_name"), sequence = "DNA_seq", marker = "marker") refdb_check_tax_conflict(lib)lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) lib <- refdb_set_fields(lib, taxonomy = c(family = "family_name", genus = "genus_name", species = "species_name"), sequence = "DNA_seq", marker = "marker") refdb_check_tax_conflict(lib)
This function uses the generalized Levenshtein (edit) distance to identify possible issue with taxonomic names.
refdb_check_tax_typo(x, tol = 1)refdb_check_tax_typo(x, tol = 1)
x |
a reference database. |
tol |
the edit distance below which two taxonomic names are reported. |
A list of two-columns tibbles reporting for each taxonomic level
the pairs of taxonomic names sharing the same upstream taxonomy and for
which the generalized Levenshtein (edit) distance is below
the tol value.
lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) lib <- refdb_set_fields(lib, taxonomy = c(family = "family_name", genus = "genus_name", species = "species_name"), sequence = "DNA_seq", marker = "marker") refdb_check_tax_typo(lib)lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) lib <- refdb_set_fields(lib, taxonomy = c(family = "family_name", genus = "genus_name", species = "species_name"), sequence = "DNA_seq", marker = "marker") refdb_check_tax_typo(lib)
Crop genetic sequences with a set of primers
refdb_clean_seq_crop_primers( x, primer_forward, primer_reverse, max_error_in = 0.1, max_error_out = 0.1, include_primers = TRUE )refdb_clean_seq_crop_primers( x, primer_forward, primer_reverse, max_error_in = 0.1, max_error_out = 0.1, include_primers = TRUE )
x |
a reference database with a defined sequence field. |
primer_forward |
primer forward. |
primer_reverse |
primer reverse. |
max_error_in, max_error_out
|
maximum error for a match (frequency based on primer length). |
include_primers |
a logical indicating whether the detected primers are included in the cropped sequences. |
A reference database.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_seq_crop_primers(lib, "AGT", "TTTA")lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_seq_crop_primers(lib, "AGT", "TTTA")
Remove gaps from genetic sequences
refdb_clean_seq_remove_gaps(x)refdb_clean_seq_remove_gaps(x)
x |
a reference database with a defined sequence field. |
A reference database.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_seq_remove_gaps(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_seq_remove_gaps(lib)
Remove repeated side N from genetic sequences
refdb_clean_seq_remove_sideN(x, side = "both")refdb_clean_seq_remove_sideN(x, side = "both")
x |
a reference database with a defined sequence field. |
side |
which side to clean.
Can be one of |
A reference database.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_seq_remove_sideN(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_seq_remove_sideN(lib)
Harmonize taxonomic name nomenclature
refdb_clean_tax_harmonize_nomenclature(x, cols = NULL)refdb_clean_tax_harmonize_nomenclature(x, cols = NULL)
x |
a reference database. |
cols |
an optional vector of column names.
If |
A reference database.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_harmonize_nomenclature(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_harmonize_nomenclature(lib)
Convert missing taxonomic names to NA
refdb_clean_tax_NA(x, cols = NULL, hybrid = TRUE, uncertain = FALSE)refdb_clean_tax_NA(x, cols = NULL, hybrid = TRUE, uncertain = FALSE)
x |
a reference database. |
cols |
an optional vector of column names.
If |
hybrid |
hybrids are converted to NA (default |
uncertain |
taxa with qualifiers of uncertainty (cf., aff., etc.)
are converted to NA (default |
A reference database.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_NA(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_NA(lib)
Remove blank characters from taxonomic names
refdb_clean_tax_remove_blank(x, cols = NULL)refdb_clean_tax_remove_blank(x, cols = NULL)
x |
a reference database. |
cols |
an optional vector of column names.
If |
A reference database.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_remove_blank(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_remove_blank(lib)
Remove extra words from taxonomic names
refdb_clean_tax_remove_extra(x, cols = NULL)refdb_clean_tax_remove_extra(x, cols = NULL)
x |
a reference database. |
cols |
an optional vector of column names.
If |
As the function can match words like "g.", "s." or "x", which can have a signification in some nomenclatures, it is recommended to execute refdb_clean_tax_harmonize_nomenclature first.
A reference database.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_remove_extra(lib)
Remove subspecific information from taxonomic names
refdb_clean_tax_remove_subsp(x, cols = NULL)refdb_clean_tax_remove_subsp(x, cols = NULL)
x |
a reference database. |
cols |
an optional vector of column names.
If |
A reference database.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_remove_subsp(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_remove_subsp(lib)
Remove terms indicating uncertainty in taxonomic names
refdb_clean_tax_remove_uncertainty(x, cols = NULL)refdb_clean_tax_remove_uncertainty(x, cols = NULL)
x |
a reference database. |
cols |
an optional vector of column names.
If |
A reference database.
Marks of taxonomic uncertainty provided by specialists are not without value. The consequences of their deletion must be well understood by the user before using this function.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_remove_uncertainty(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_clean_tax_remove_uncertainty(lib)
Write reference database in formats which can be used with the functions of the package dada2.
refdb_export_dada2(x, file, mode = "taxonomy")refdb_export_dada2(x, file, mode = "taxonomy")
x |
a reference database. |
file |
a path to the file to be written. |
mode |
character string to determine the type of file to produce.
Use |
No return value, called for side effects.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_export_dada2(lib, tempfile())lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_export_dada2(lib, tempfile())
Write a reference database in file formats which can be used
to train the IDTAXA classifier implemented in DECIPHER.
refdb_export_idtaxa(x, file, taxid = FALSE)refdb_export_idtaxa(x, file, taxid = FALSE)
x |
a reference database. |
file |
a file path without extension. This will be used to create a .fasta file and two .txt files. |
taxid |
should the taxid file be generated (can be very slow with large databases) |
The functions generates three files.
- A fasta files containing the sequences with their IDs.
This file must be imported as a DNAStringSet
to be used with DECIPHER, using eg:Biostrings::readDNAStringSet("ex_seqs.fasta")
- A text files containing the sequence taxonomic assignment.
This file must be imported as a character vector
to be used with DECIPHER, using eg:readr::read_lines("ex_taxo.txt")
- A text file ("taxid") containing the taxonomic ranks
associated with each taxon. This is an asterisk delimited file
which must be imported as a dataframe (see LearnTaxa), using eg:readr::read_delim("ex_ranks.txt",
col_names = c('Index', 'Name', 'Parent', 'Level', 'Rank'),
delim = "*", quote = "")
The taxid file can be very slow to write for large datasets. Therefore it is not generated by default.
No return value, called for side effects.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_export_idtaxa(lib, tempfile())lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_export_idtaxa(lib, tempfile())
Write a reference database in formats which can be used
with Mothur.
refdb_export_mothur(x, file)refdb_export_mothur(x, file)
x |
a reference database. |
file |
a file path. This will be used to create a .fasta file and a .txt file. |
No return value, called for side effects.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_export_mothur(lib, tempfile())lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_export_mothur(lib, tempfile())
Write a reference database in utax format.
refdb_export_utax(x, file, verbose = TRUE)refdb_export_utax(x, file, verbose = TRUE)
x |
a reference database. |
file |
a file path. This will be used to create a .fasta file. |
verbose |
print information in the console. |
No return value, called for side effects.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_export_utax(lib, tempfile())lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_export_utax(lib, tempfile())
Replace NA values in taxonomic classification using upstream ranks.
refdb_fill_tax_downstream(x, qualifier = "indet.")refdb_fill_tax_downstream(x, qualifier = "indet.")
x |
a reference database. |
qualifier |
a string to add the new labels.
Default ensure that |
A reference database.
refdb_fill_tax_upstream to replace NA values using downstream data.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_fill_tax_downstream(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_fill_tax_downstream(lib)
Replace NA values in taxonomic classification using downstream ranks.
refdb_fill_tax_upstream(x, qualifier = "undef.")refdb_fill_tax_upstream(x, qualifier = "undef.")
x |
a reference database. |
qualifier |
a string to add the new labels.
Default ensure that |
A reference database.
refdb_fill_tax_downstream to replace terminal NA values using upstream data.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_fill_tax_upstream(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_fill_tax_upstream(lib)
Filter records by taxonomic scope of studies
refdb_filter_ref_scope(x, max_tax)refdb_filter_ref_scope(x, max_tax)
x |
a reference database (tibble). |
max_tax |
the maximum (widest) taxonomic focus of the study. |
A reference field (one ore more columns) must be set to use this function. If reference is not available (NA) for a record, the record is not dropped.
a reference database (tibble).
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) lib$refs <- rep("REF_1", nrow(lib)) lib <- refdb_set_fields(lib, reference = "refs") refdb_filter_ref_scope(lib, max_tax = "family_name")lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) lib$refs <- rep("REF_1", nrow(lib)) lib <- refdb_set_fields(lib, reference = "refs") refdb_filter_ref_scope(lib, max_tax = "family_name")
Filter sequences based on their number of ambiguous character.
refdb_filter_seq_ambiguous(x, max_ambig = 3L, char = "N")refdb_filter_seq_ambiguous(x, max_ambig = 3L, char = "N")
x |
a reference database. |
max_ambig |
maximum number of ambiguous character. |
char |
characters interpreted as ambiguous (vector). |
A tibble (filtered reference database).
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_ambiguous(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_ambiguous(lib)
Exclude duplicated sequences. This is based both on sequences and taxonomy. NA values are assumed to be comparable.
refdb_filter_seq_duplicates(x)refdb_filter_seq_duplicates(x)
x |
a reference database. |
A tibble (filtered reference database).
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_duplicates(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_duplicates(lib)
Filter sequences based on their number of repeated character.
refdb_filter_seq_homopolymers(x, max_len = 16L)refdb_filter_seq_homopolymers(x, max_len = 16L)
x |
a reference database. |
max_len |
maximum number of repeated character (homopolymer). |
A tibble (filtered reference database).
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_homopolymers(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_homopolymers(lib)
Filter sequences based on their number of character.
refdb_filter_seq_length(x, min_len = NULL, max_len = NULL, gaps = FALSE)refdb_filter_seq_length(x, min_len = NULL, max_len = NULL, gaps = FALSE)
x |
a reference database. |
min_len, max_len
|
minimum and maximum sequence lengths.
Use |
gaps |
if |
A tibble (filtered reference database).
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_length(lib, 50L)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_length(lib, 50L)
Filter sequences based on the presence of primers.
refdb_filter_seq_primer( x, primer_forward = NULL, primer_reverse = NULL, max_error_forward = 0.1, max_error_reverse = 0.1 )refdb_filter_seq_primer( x, primer_forward = NULL, primer_reverse = NULL, max_error_forward = 0.1, max_error_reverse = 0.1 )
x |
a reference database. |
primer_forward |
forward primer. |
primer_reverse |
reverse primer. |
max_error_forward, max_error_reverse
|
maximum error for match (frequency base on primer length). |
A tibble (filtered reference database).
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_primer(lib, "ACTA")lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_primer(lib, "ACTA")
Filter sequences based on their number of of stop codons.
refdb_filter_seq_stopcodon(x, max_stop = 0, code, codon_frame = NA)refdb_filter_seq_stopcodon(x, max_stop = 0, code, codon_frame = NA)
x |
a reference database. |
max_stop |
maximum number of stop codons. |
code |
an integer indicating the genetic code to use for translation (see genetic-codes). |
codon_frame |
an integer giving the nucleotide position where
to start translation. If |
A tibble (filtered reference database).
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_stopcodon(lib, code = 5)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_seq_stopcodon(lib, code = 5)
Remove records where taxa is NA if it is not the only representant of the upper clade. Note that the function maybe slow on large datasets. //EXPERIMENTAL//
refdb_filter_tax_na(x)refdb_filter_tax_na(x)
x |
a reference database. (column name of the reference database). |
A tibble (filtered reference database).
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_tax_na(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_tax_na(lib)
Filter records based on their taxonomic precision.
refdb_filter_tax_precision(x, min_tax)refdb_filter_tax_precision(x, min_tax)
x |
a reference database. |
min_tax |
minimum taxonomic level (column name of the reference database). |
A tibble (filtered reference database).
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_tax_precision(lib, min_tax = "family_name")lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_filter_tax_precision(lib, min_tax = "family_name")
Get fields of a reference database
refdb_get_fields(x, silent = FALSE)refdb_get_fields(x, silent = FALSE)
x |
a reference database. |
silent |
if |
The list of fields is returned invisibly.
lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) refdb_get_fields(lib)lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) refdb_get_fields(lib)
This function allows to search and download data from the the NCBI Nucleotide database. Additionally it uses the NCBI Taxonomy database to get the sequence taxonomic classification.
refdb_import_NCBI( query, full = FALSE, max_seq_length = 10000, seq_bin = 200, verbose = TRUE, start = 0L )refdb_import_NCBI( query, full = FALSE, max_seq_length = 10000, seq_bin = 200, verbose = TRUE, start = 0L )
query |
a character string with the query. |
full |
a logical. If FALSE (the default), only a subset of the most important fields is included in the result. |
max_seq_length |
a numeric giving the maximum length of sequences to retrieve. Useful to exclude complete genomes. |
seq_bin |
number of sequences to download at once. |
verbose |
print information in the console. |
start |
an integer giving the index where to start to download. For debugging purpose mainly. |
This function uses several functions of the rentrez package to interface with the NCBI's EUtils API.*
A tibble.
Error in curl::curl_fetch_memory(url, handle = handle) :
transfer closed with outstanding read data remaining
This error seems to appear with long sequences.
You can try to decrease max_seq_length to exclude them.
try(silo_ncbi <- refdb_import_NCBI("Silo COI"))try(silo_ncbi <- refdb_import_NCBI("Silo COI"))
Merge several reference database by common fields.
refdb_merge(..., keep = "fields_all")refdb_merge(..., keep = "fields_all")
... |
reference databases (tibbles). |
keep |
determines which columns to keep.
Can be |
Columns are merged only if they are associated to the same field.
The keep argument determines which columns are returned as follow.
"fields_all" (the default) returns all the fields
existing in all the reference databases.
"fields_shared" returns only the fields shared by
all the reference databases.
"all" returns all the columns of all the databases.
Columns which are not associated to a field are not merged and are prefixed
with the name of the object they originated from.
a merged reference database (tibble).
lib_1 <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib_1 <- refdb_set_fields_BOLD(lib_1) lib_2 <- lib_1 refdb_merge(lib_1, lib_2)lib_1 <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib_1 <- refdb_set_fields_BOLD(lib_1) lib_2 <- lib_1 refdb_merge(lib_1, lib_2)
This functions generate an interactive maps showing the location of the records of a reference database. Note that only records with latitude and longitude data will be displayed.
refdb_plot_map(x)refdb_plot_map(x)
x |
a reference database. |
An interactive map object from the leaflet package.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) lib <- refdb_set_fields(lib, latitude = "lat", longitude = "lon") refdb_plot_map(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) lib <- refdb_set_fields(lib, latitude = "lat", longitude = "lon") refdb_plot_map(lib)
Plot an histogram of sequence lengths
refdb_plot_seqlen_hist(x, remove_gaps = TRUE)refdb_plot_seqlen_hist(x, remove_gaps = TRUE)
x |
a reference database |
remove_gaps |
a logical (default |
A ggplot object. This means the plot can be further customized using ggplot2 compatible functions.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_plot_seqlen_hist(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_plot_seqlen_hist(lib)
Generate a multipanel plot where, for each taxonomic level, a barplot represent the number of records available in the reference database for the most represented taxa.
refdb_plot_tax_barplot(x, show_n = 10)refdb_plot_tax_barplot(x, show_n = 10)
x |
a reference database. |
show_n |
an integer value indicating the number of taxa to show in each panel. |
A ggplot object.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) lib <- refdb_set_fields(lib, latitude = "lat", longitude = "lon") refdb_plot_tax_barplot(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) lib <- refdb_set_fields(lib, latitude = "lat", longitude = "lon") refdb_plot_tax_barplot(lib)
Represent the hierarchical structure of the taxonomic information of a reference database as a tree.
refdb_plot_tax_tree( x, leaf_col = NULL, color_col = NULL, freq_labels = 0, expand_plot = 0.5 )refdb_plot_tax_tree( x, leaf_col = NULL, color_col = NULL, freq_labels = 0, expand_plot = 0.5 )
x |
a reference database. |
leaf_col |
a column name referring to the taxonomic level
for the leaves of the tree. If not provided ( |
color_col |
a column name referring to the taxonomic level
for the color of the leaves (must be higher or equal to the level
of |
freq_labels |
a numeric value to adjust the number of printed labels (minimum frequency). Default is zero which means all non-NA labels are printed. |
expand_plot |
a value to expand the limits of the plot. Useful if the labels are too long. |
The underlying graph is computed using the non-exported function
igraph_from_taxo.
A ggplot2 (ggraph) object. This means the plot can be further customized using ggplot2 compatible functions.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_plot_tax_tree(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_plot_tax_tree(lib)
Represent the hierarchical structure of the taxonomic information of a reference database as a set of nested rectangles (treemap).
refdb_plot_tax_treemap(x, cols = NULL, freq_labels = c(0.01, 0.003))refdb_plot_tax_treemap(x, cols = NULL, freq_labels = c(0.01, 0.003))
x |
a reference database. |
cols |
a vector of column names referring to taxonomic levels
to include in the treemap. If not provided ( |
freq_labels |
a numeric vector of length two to adjust the number of printed labels (see Details). Only the columns provided in the The number of labels printed are determined by The underlying graph is computed using the non-exported function
|
A ggplot2 (ggraph) object. This means the plot can be further customized using ggplot2 compatible functions.
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_plot_tax_treemap(lib)lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_plot_tax_treemap(lib)
This function produce an HTML report to investigate potential issues in a reference database.
refdb_report(x, file = NULL, view = TRUE)refdb_report(x, file = NULL, view = TRUE)
x |
a reference database. |
file |
the file (path) to write the report. If |
view |
A logical. If |
The function invisibly returns the file where the report was written.
lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) lib <- refdb_set_fields(lib, taxonomy = c(family = "family_name", genus = "genus_name", species = "species_name"), sequence = "DNA_seq", marker = "marker") tmp <- tempfile() refdb_report(lib, tmp, view = FALSE)lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) lib <- refdb_set_fields(lib, taxonomy = c(family = "family_name", genus = "genus_name", species = "species_name"), sequence = "DNA_seq", marker = "marker") tmp <- tempfile() refdb_report(lib, tmp, view = FALSE)
This function can be useful to keep a maximum of records per taxa. This function require dplyr dev version to work because of slice_sample. Will be exported once available.
refdb_sample_tax(x, n_max = 10, cols = NULL)refdb_sample_tax(x, n_max = 10, cols = NULL)
x |
a reference database. |
n_max |
maximum number of records to keep for each taxa. |
cols |
an optional vector of column names.
If |
A reference database.
Associate columns to fields so they are recognized and appropriately treated by refdb functions.
refdb_set_fields( x, source = NA, id = NA, organism = NA, taxonomy = NA, sequence = NA, marker = NA, latitude = NA, longitude = NA, reference = NA, config_yaml = NULL )refdb_set_fields( x, source = NA, id = NA, organism = NA, taxonomy = NA, sequence = NA, marker = NA, latitude = NA, longitude = NA, reference = NA, config_yaml = NULL )
x |
a reference database (tibble). |
source |
name of the column which contains the data source. |
id |
name of the column which contains the record IDs. |
organism |
name of the column which contains the names of the organisms. |
taxonomy |
a vector of column names. |
sequence |
name of the column which contains the sequences. |
marker |
name of the column which contains marker names. |
latitude |
name of the column which contains latitudes (WGS 84) |
longitude |
name of the column which contains longitudes (WGS 84). |
reference |
a vector of column names. |
config_yaml |
a file path to a YAML file |
Taxonomy reordering. NA to ignore, NULL to delete. Fields set using config_yaml always overwrite those set by arguments
The function returns x with updated attributes.
lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) lib <- refdb_set_fields(lib, taxonomy = c(family = "family_name", genus = "genus_name", species = "species_name"), sequence = "DNA_seq", marker = "marker")lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) lib <- refdb_set_fields(lib, taxonomy = c(family = "family_name", genus = "genus_name", species = "species_name"), sequence = "DNA_seq", marker = "marker")
Replace the current taxonomy using the NCBI Taxonomy database
refdb_set_ncbitax( x, min_level = "species", force_species_name = TRUE, verbose = TRUE )refdb_set_ncbitax( x, min_level = "species", force_species_name = TRUE, verbose = TRUE )
x |
a reference database (tibble) with one or several columns giving the taxonomy of each record and explicitly indicated in the field taxonomy. See refdb_set_fields. |
min_level |
minimum taxonomic level at which taxonomy
should be replaced. Default is the finest level ( |
force_species_name |
if |
verbose |
print information in the console. |
The reference database with the NCBI taxonomy for the genus level and higher ranks. (the original taxonomy above the genus level is removed).
lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) try(refdb_set_ncbitax(lib))lib <- read.csv(system.file("extdata", "baetidae_bold.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) try(refdb_set_ncbitax(lib))
This function can be used to save fields defined
using e.g. refdb_set_fields to a file.
Data are saved in YAML and can be read again using the
config_yaml argument of refdb_set_fields.
refdb_write_fields(x, file)refdb_write_fields(x, file)
x |
a reference database with some fields to be saved. |
file |
a path to the file to write. |
No return value, called for its side effects.
lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) tmp <- tempfile() refdb_write_fields(lib, tmp)lib <- read.csv(system.file("extdata", "ephem.csv", package = "refdb")) tmp <- tempfile() refdb_write_fields(lib, tmp)
Ranks considered as valid by refdb
valid_taxo_rank()valid_taxo_rank()
a vector of ordered ranks.
This is a simplified version of the
list rank_ref available in taxize.
valid_taxo_rank()valid_taxo_rank()
Combine xml_find_first and xml_text to extract elements.
xml_extract(x, xpath)xml_extract(x, xpath)
x |
A document, node, or node set. |
xpath |
A string containing a xpath expression. |
A character vector, the same length as x.