Skip to content

xml_url does not work (at least in combination with base_url) #300

Closed
@mwaldstein

Description

@mwaldstein

The behavior of xml_url appears to have changed pretty drastically in v1.3.0.

Previously, passing base_url via read_html resulted in it setting the url returned by xml_url

These are deeply simplified examples but highlight the change in behavior. I can work around the change from "NA" to "<CHARSXP: NA>", but the loss of the xml_url is a big loss which will require a bit of re-architecting.

This breaks edgarWebR (currently off CRAN due to vignettes making remote API calls)

Using string input

Example 1

require(xml2)
doc <- read_html("<html/>", base_url = "http://test.com")
xml_url(doc)

On v1.2.5 the output was "http://test.com"
On v1.3.1 the output is "UTF-8"

Example 2

require(xml2)
doc <- read_html("<html/>")
xml_url(doc)

On v1.2.5 the output was "NA"
Ov v1.3.0 the output is "UTF-8"

Using httr response

Example 3

require(xml2)
require(httr)
href <- "https://www.sec.gov/cgi-bin/cik_lookup?company=cloudera"
res <- GET(href)
doc <- read_html(res, base_url = href)
xml_url(doc)

On v1.2.5 the output was "https://www.sec.gov/cgi-bin/cik_lookup?company=cloudera"
Ov v1.3.0 the output is "<CHARSXP: NA>"

Example 4

require(xml2)
require(httr)
href <- "https://www.sec.gov/cgi-bin/cik_lookup?company=cloudera"
res <- GET(href)
doc <- read_html(res)
xml_url(doc)

On v1.2.5 the output was "NA"
Ov v1.3.0 the output is "<CHARSXP: NA>"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions