Skip to content

print.xml_nodeset very slow for document with one huge node #366

Closed
@MichaelChirico

Description

@MichaelChirico

Found in the XML representation of an edge case R file:

library(xml2)
library(xmlparsedata)

p = parse("https://raw.githubusercontent.com/mwaldstein/edgarWebR/fb9a38e6a57186ffd1c93cc1aa00c4fdf1bc5514/tests/cache/browse-edgar-11457c.R")
xml = read_xml(xml_parse_data(p))

Printing this is painfully slow:

system.time(print(xml))
# {xml_document}
# <exprlist>
# [1] <expr line1="1" col1="1" line2="5944" col2="43" start="145" end="855979">\n  <expr line1="1" col1="1" line2="1" col2="9" start="145" end="153">\n    <SYMBOL_FUNCTION_CALL li ...
#    user  system elapsed 
#   2.906   0.048   2.958 

Took a brief look, it looks like encodeString() is the culprit:

# ** debugging inside show_nodes() **
system.time(vapply(x, as.character, FUN.VALUE = character(1)))
#    user  system elapsed 
#   0.248   0.017   0.268 
system.time(encodeString(vapply(x, as.character, FUN.VALUE = character(1))))
#    user  system elapsed 
#   2.959   0.024   3.007

Is it possible to apply substr() twice -- once after as.character(), then again after encodeString()?

chr = vapply(x, as.character, FUN.VALUE = character(1))
nchar(chr)
# [1] 18965721

This is clearly already wayyy to wide (width = 180 for me).

I believe we can always just apply

x %>%
  substring(1, n) %>%
  encodeString() %>%
  substring(1, n)

since the default behavior of encodeString() is to simply add \ to non-printable characters, so it will just be a weakly wider version of the input.

Happy to file a PR if that sounds good.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions