Skip to contents

This function scrapes a web page for all links (<a> tags) and extracts both the URLs and the link text.

Usage

ScrapLinks(URL, Arrange = c("Link", "Link_text"))

Arguments

URL

Character. The URL of the web page to scrape. This URL is also used to resolve relative links to absolute URLs.

Arrange

Character vector of length 1 or 2. The columns to arrange the output by. The default is c("Link", "Link_text"). The first column is the URL of the link, and the second column is the text of the link. The function will arrange the output in ascending order by the column(s) specified in this argument.

Value

A tibble with two columns: Link_text containing the text of each link, and URL containing the absolute URL of each link. The tibble is sorted by URL and then by link text, and only unique links are included.

Examples


head(
ScrapLinks(URL = "https://github.com/BioDT/IASDT.R"))
#> # A tibble: 6 × 2
#>   Link_text                Link                                                
#>   <chr>                    <chr>                                               
#> 1 https://biodt.eu         https://biodt.eu                                    
#> 2 BioDT                    https://biodt.eu/                                   
#> 3 link                     https://biodt.eu/                                   
#> 4 IASDT.R                  https://biodt.github.io/IASDT.R                     
#> 5 biodt.github.io/IASDT.R/ https://biodt.github.io/IASDT.R/                    
#> 6 here                     https://biodt.github.io/IASDT.R/reference/index.html

head(
  ScrapLinks(
    URL = "https://github.com/BioDT/IASDT.R", Arrange = "Link_text"))
#> # A tibble: 6 × 2
#>   Link_text     Link                                                            
#>   <chr>         <chr>                                                           
#> 1 + 1 release   https:/github.com/BioDT/IASDT.R/BioDT/IASDT.R/releases          
#> 2 .Rbuildignore https:/github.com/BioDT/IASDT.R/BioDT/IASDT.R/blob/main/.Rbuild…
#> 3 .github       https:/github.com/BioDT/IASDT.R/BioDT/IASDT.R/tree/main/.github 
#> 4 .gitignore    https:/github.com/BioDT/IASDT.R/BioDT/IASDT.R/blob/main/.gitign…
#> 5 .lintr        https:/github.com/BioDT/IASDT.R/BioDT/IASDT.R/blob/main/.lintr  
#> 6 0 forks       https:/github.com/BioDT/IASDT.R/BioDT/IASDT.R/forks             

# This will give an "Invalid URL" error
if (FALSE) { # \dontrun{
 ScrapLinks(URL = "https://github50.com")
} # }