R is a language and environment for statistical computing and graphics, that allows for automation of repetitive task as well as improves the reproducibility of science. R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form.

This is a simple tutorial to learn how to use the rgbif R package to download, clean and visualize biodiversity occurrences from the GBIF data portal. GBIF workshop slides can be found here.

Starting session

First, we install and load the R packages we will use.

# Install packages (if needed)
#install.packages("rgbif")
#install.packages("ggplot2")
#install.packages("maps")

# Load packages in R environment
library(rgbif)
library(ggplot2)
library(maps)



Selecting taxa name

There are some functions in the rgbif R package that are useful to look for scientific species names or other taxa that are included in the GBIF Backbone Taxonomy. This is important when we don’t know exactly the scientific name of a species, or when we don’t know if the taxonomy of the species has changed in the last years. It is also important to check what is the accepted name in the GBIF backbone taxonomy. To do that we can use the name_backbone function.

Let’s take the American mink, Neovison vison as an example:

nBb <- name_backbone(name="Neovison vison", rank="species")
nBb
## # A tibble: 1 × 25
##   usageKey acceptedUsageKey scientificName canonicalName rank  status confidence
## *    <int>            <int> <chr>          <chr>         <chr> <chr>       <int>
## 1  2433652          5218823 Neovison viso… Neovison vis… SPEC… SYNON…         98
## # … with 18 more variables: matchType <chr>, kingdom <chr>, phylum <chr>,
## #   order <chr>, family <chr>, genus <chr>, species <chr>, kingdomKey <int>,
## #   phylumKey <int>, classKey <int>, orderKey <int>, familyKey <int>,
## #   genusKey <int>, speciesKey <int>, synonym <lgl>, class <chr>,
## #   verbatim_name <chr>, verbatim_rank <chr>

This table say that, even when Neovison vison is the currently accepted species name by the scientific community, it is not the accepted name in the GBIF Backbone Taxonomy (see SYNON... in the status field). Thus, we should look for the list of synonyms of the accepted name in the GBIF Backbone Taxonomy. To do that, we take the acceptedUsageKey number and use the function name_usage.

# Accepted name in the GBIF Backbone Taxonomy
nBb$species
## [1] "Mustela vison"
nBb$acceptedUsageKey
## [1] 5218823
# Here we use the acceptedUsageKey to look for their synonyms
syn <- name_usage(key = 5218823, data = "synonyms")
syn
## Records returned [3] 
## Args [offset=0, limit=100] 
## # A tibble: 3 × 41
##       key  nameKey taxonID      sourceTaxonKey kingdom phylum order family genus
##     <int>    <int> <chr>                 <int> <chr>   <chr>  <chr> <chr>  <chr>
## 1 9309757 16715214 gbif:9309757      137374031 Animal… Chord… Carn… Muste… Must…
## 2 2433652  7480942 gbif:2433652      172723039 Animal… Chord… Carn… Muste… Must…
## 3 9458324 16715212 gbif:9458324      121517542 Animal… Chord… Carn… Muste… Must…
## # … with 32 more variables: species <chr>, kingdomKey <int>, phylumKey <int>,
## #   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
## #   speciesKey <int>, datasetKey <chr>, constituentKey <chr>, parentKey <int>,
## #   parent <chr>, acceptedKey <int>, accepted <chr>, basionymKey <int>,
## #   basionym <chr>, scientificName <chr>, canonicalName <chr>,
## #   authorship <chr>, nameType <chr>, rank <chr>, origin <chr>,
## #   taxonomicStatus <chr>, nomenclaturalStatus <lgl>, remarks <chr>, …
syn$data$scientificName
## [1] "Lutreola vison (Schreber, 1777)" "Neovison vison (Schreber, 1777)"
## [3] "Putorius vison (Schreber, 1777)"

We see that there are three synonyms of Mustela vison in the GBIF Backbone Taxonomy database: Lutreola vison, Neovison vison and Putorius vison. We can check now the number of occurrences assigned at each name.

# Mustela vison
occ_count(5218823)
## [1] 52772
# Neovison vison
occ_count(2433652)
## [1] 40828
# Lutreola vison
occ_count(9309757)
## [1] 0
# Putorius vison
occ_count(9458324)
## [1] 3

In any case, we should use the accepted name in the GBIF Backbone Taxonomy database, since it contains all the occurrences from all synonyms. As we see, the name with the most number of occurrences is Mustela vison, so we will use its key number (5218823) for occurrence searches.

Simple queries

As we already know, GBIF uses Darwin Core as the standard for biodiversity data storage. It means that if we don’t specify the fields in which we are interested in, we will obtain a table with 84 fields (headers). Click here for the list with all available fields in Darwin Core databases, and here for the required or strongly recommended fields in GBIF.

# Search with all Darwin Core fields
# We use limit to get only 10 records
occNeo <- occ_search(taxonKey = 5218823, limit = 10)
# Check the dimensions
dim(occNeo$data)
## [1] 10 84
# Here we select only the fields we are interested in
selFields <- c("key","scientificName", "decimalLatitude","decimalLongitude",
               "issues", "country", "basisOfRecord","year")
# We use limit to get only 10 records
occNeo <- occ_search(taxonKey = 5218823, fields = selFields, limit = 10)
# Check the dimensions
dim(occNeo$data)
## [1] 10  8

We will use just those fields for the rest of this tutorial. It is important to mention that occ_search function does not provide a DOI of our search. We’ll see later how to obtain a DOI for our searches using rgbif package.

Advanced queries and filtering

We can add different filter to our searches to refine our searches. For a complete list of available filters you can use help(occ_search). Now we’ll apply some of them that can be very useful to reduce the number of registers.

Actually, we have already applied a first filter, called limit to reduce the maximum number of occurrence. The function occ_search allows a maximum of 100,000 occurrences, so if our taxa has more records we’ll use a different approach (we’ll come back about this later).

Other useful filter can be the country. We can introduce the country code to filter for one or more countries.

# Look for occurrences in Spain and Italy
occNeo <- occ_search(taxonKey = 5218823, fields = selFields, country = c("ES","IT"))
summary(occNeo)
##    Length Class  Mode
## ES 5      -none- list
## IT 5      -none- list

We can also delimit a geographical window for occurrence filtering.

# Look for occurrences in a specific geographic window
occNeo <- occ_search(taxonKey = 5218823, fields = selFields, 
                     decimalLongitude = "1, 8", decimalLatitude = "48, 57")

# We can now map the region with the obtained occurrences
map("world", xlim = range(occNeo$data$decimalLongitude), ylim = range(occNeo$data$decimalLatitude))  
points(occNeo$data[ , c("decimalLongitude", "decimalLatitude")], pch = 19, col = "darkred", cex=0.5)

Many other filter such as the continent or collection year can be applied. For a complete list, please check help(occ_search). We can then store the data in our computer in a CSV file.

write.csv(occNeo$data, "Neovison_vison.csv")



Submitting heavy queries

As we said, the function occ_search only allows to download 100,000 records from the GBIF data base. If our species of interest has more records, we should use occ_download. This function works in a similar way to queries in the GBIF web portal. It is necessary to login with your account: user name, password and e-mail.

user <- "irec"
pwd <- "******"
email <- "irec.aplicaciones@gmail.com"

The function occ_download submit a query to the GBIF portal. Once it has been processed, data will be ready for downloading and cleaning. Filters in occ_download can be applied by using the function pred. Check help(pred) for more information about the usage of filters in occ_download.

# Submit a query
occDlNeo <- occ_download(pred("taxonKey", 5218823), 
                         pred("hasCoordinate", TRUE),
                         pred("continent", "europe"), 
                         user = user, email = email, pwd = pwd)
# Check the query data
occDlNeo
## <<gbif download>>
##   Your download is being processed by GBIF:
##   https://www.gbif.org/occurrence/download/0146544-210914110416597
##   Most downloads finish within 15 min.
##   Check status with
##   occ_download_wait('0146544-210914110416597')
##   After it finishes, use
##   d <- occ_download_get('0146544-210914110416597') %>%
##     occ_download_import()
##   to retrieve your download.
## Download Info:
##   Username: irec
##   E-mail: irec.aplicaciones@gmail.com
##   Format: DWCA
##   Download key: 0146544-210914110416597
##   Created: 2022-02-18T08:06:30.631+00:00
## Citation Info:  
##   Please always cite the download DOI when using this data.
##   https://www.gbif.org/citation-guidelines
##   DOI: 10.15468/dl.83btce
##   Citation:
##   GBIF Occurrence Download https://doi.org/10.15468/dl.83btce Accessed from R via rgbif (https://github.com/ropensci/rgbif) on 2022-02-18
# Check the query status
occ_download_meta(occDlNeo)
## <<gbif download metadata>>
##   Status: SUCCEEDED
##   DOI: 10.15468/dl.83btce
##   Format: DWCA
##   Download key: 0146544-210914110416597
##   Created: 2022-02-18T08:06:30.631+00:00
##   Modified: 2022-02-18T08:07:35.734+00:00
##   Download link: https://api.gbif.org/v1/occurrence/download/request/0146544-210914110416597.zip
##   Total records: 11448

It is important to highlight that occ_download provides us a DOI assigned to our query. As we already know, it will be useful for data citation and tracking.

To visualize and clean the data it is necessary to download a ZIP file to our local machine.

# "path" should contain the path to our local machine
zip <- occ_download_get(key = occ_download_meta(occDlNeo)$key, 
                        path="/home/myPath", overwrite = TRUE)



Now, we can check for data citation, clean the occurrences, etc.

# Check number of data sets in our downloaded data
length(gbif_citation(zip)$datasets)
## [1] 33
# Check for the three first data sets and how to cite them
gbif_citation(zip)$datasets[1:3]
## [[1]]
## <<rgbif citation>>
##    Citation: Haas F, Green M, Jönsson A (2021). Swedish Bird Survey: Swedish
##         coastal bird monitoring programme (Nationella kustfågelövervakningen).
##         Version 1.3. Department of Biology, Lund University. Sampling event
##         dataset https://doi.org/10.15468/sg5spw accessed via GBIF.org on
##         2022-02-18.. Accessed from R via rgbif
##         (https://github.com/ropensci/rgbif) on 2022-02-18 Citation: HELCOM
##         (2018). Abundance of waterbirds in the breeding season. HELCOM core
##         indicator report. See
##         https://helcom.fi/media/core%20indicators/Abundance-of-waterbirds-in-the-breeding-season-HELCOM-core-indicator-2018.pdf.
##         Accessed from R via rgbif (https://github.com/ropensci/rgbif) on
##         2022-02-18
##    Rights: To the extent possible under law, the publisher has waived all
##         rights to these data and has dedicated them to the Public Domain (CC0
##         1.0). Users may copy, modify, distribute and use the work, including
##         for commercial purposes, without restriction.
## 
## [[2]]
## <<rgbif citation>>
##    Citation: John Lucey. River Biologists' Database (EPA). National
##         Biodiversity Data Centre. Occurrence dataset
##         https://doi.org/10.15468/jl6rlf accessed via GBIF.org on 2022-02-18..
##         Accessed from R via rgbif (https://github.com/ropensci/rgbif) on
##         2022-02-18
##    Rights: This work is licensed under a Creative Commons Attribution (CC-BY)
##         4.0 License.
## 
## [[3]]
## <<rgbif citation>>
##    Citation: Citizen Science Observation Dataset B, Tiago P (2020).
##         Biodiversity4all Research-Grade Observations. BioDiversity4All.
##         Occurrence dataset https://doi.org/10.15468/njmmp7 accessed via
##         GBIF.org on 2022-02-18.. Accessed from R via rgbif
##         (https://github.com/ropensci/rgbif) on 2022-02-18
##    Rights: This work is licensed under a Creative Commons Attribution Non
##         Commercial (CC-BY-NC) 4.0 License.
# Import the data in the R session
occNeoImp <- occ_download_import(zip)

# Since we didn't select any field, the downloaded data contain all the Darwin
# Core fields, so we can clean the data and select only those fields we are 
# interested in.
# colnames(occNeoImp)

occNeoCoords <- occNeoImp[ ,c("scientificName", "decimalLongitude", 
                              "decimalLatitude", "issue")]
head(occNeoCoords)
## # A tibble: 6 × 4
##   scientificName                  decimalLongitude decimalLatitude issue        
##   <chr>                                      <dbl>           <dbl> <chr>        
## 1 Neovison vison (Schreber, 1777)            11.3             60.8 "GEODETIC_DA…
## 2 Neovison vison (Schreber, 1777)             5.78            58.8 "GEODETIC_DA…
## 3 Neovison vison (Schreber, 1777)            14.7             58.3 ""           
## 4 Neovison vison (Schreber, 1777)            11.6             57.7 ""           
## 5 Neovison vison (Schreber, 1777)            11.8             57.7 ""           
## 6 Neovison vison (Schreber, 1777)            18.3             59.2 ""