Skip to content

Added OriginalMarkdown Property #2737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed

Added OriginalMarkdown Property #2737

wants to merge 11 commits into from

Conversation

joejoe87577
Copy link

@joejoe87577 joejoe87577 commented Jan 10, 2019

Added logic for OriginalMarkdown in different block types

Issue: #2729

PR Type

What kind of change does this PR introduce?
Feature

What is the current behavior?

As described in #2729

What is the new behavior?

Added new property OriginalMarkdown to MarkdownBlock and added logic to extract the original markdown from the parsed string in all block types.

PR Checklist

Please check if your PR fulfills the following requirements:

  • Tested code with current supported SDKs
  • Sample in sample app has been added / updated (for bug fixes / features)
  • Pull Request has been submitted to the documentation repository instructions. Link:
  • Tests for the changes have been added (for bug fixes / features) (if applicable)
  • Header has been added to all new source files (run build/UpdateHeaders.bat)
  • Contains NO breaking changes

Other information

Added logic for OriginalMarkdown in different block types
@dnfclas
Copy link

dnfclas commented Jan 10, 2019

CLA assistant check
All CLA requirements met.

@dnfclas
Copy link

dnfclas commented Jan 10, 2019

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

❌ joejoe87577 sign now
You have signed the CLA already but the status is still pending? Let us recheck it.

@nmetulev
Copy link
Contributor

Looks good to me. The only question/concern I have is the additional memory used by the new property. Is there a way to use a reference to a substring instead, using span maybe?

@joejoe87577
Copy link
Author

@nmetulev I've not yet used Span<T> so far. Please correct me if I'm wrong here.
My (very simple) understanding is, that Span<T> is a new implementation to iterate, sort or splice arrays with a lower memory footprint than traditional methods..

Your idea would be to store the resulting string as char[] made with Span<T> from the parsed markdown or to substring the parsed markdown with a Span<T> and store it as a String?

@windowstoolkitbot
Copy link

This PR seems inactive. @joejoe87577 Do you need help to complete this issue?

@joejoe87577
Copy link
Author

I'm just waiting for another review as this pr needs two reviews.

@nmetulev
Copy link
Contributor

nmetulev commented Feb 1, 2019

Sorry for the late response @joejoe87577. I just realized that Span is not available in .NET Standard 2.0 so we probably can't use it. However, I'm still wondering if there is a way to avoid having to keep a copy of the string in each element and instead generating it as needed. For example, what if we keep a reference as the range in the original string only and generate the string in ToString()?

@michael-hawker
Copy link
Member

We should also add some Unit Tests to our Markdown Unit Test project to ensure that the right results are returned as expected.

@windowstoolkitbot
Copy link

This PR seems inactive. @joejoe87577 Do you need help to complete this issue?

@michael-hawker michael-hawker added this to the 6.0 milestone Mar 5, 2019
Copy link
Contributor

@kbrons kbrons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code works well, the only small issue is that the OriginalMarkdown property isn't filled on List children, but only on the list itself.

@michael-hawker michael-hawker modified the milestones: 6.0, 6.1 Oct 29, 2019
@michael-hawker
Copy link
Member

@Sergio0694 what are your thoughts on Span and perf here and how we could optimize this?

@Sergio0694
Copy link
Member

@michael-hawker I'll have a look at the specific package and this PR and get back to you!

Also @nmetulev about Span<T> not being available on .NET Standard 2.0, as of now they're already included in the base Microsoft.Toolkit package, as I've added a reference to System.Memory in #3131. So we can definitely use them in other dependent packages as well now.

In fact, I was just looking around the markdown parser in this package the other day, and was thinking that it could definitely use some Span<T> love - right now there are a lot of substrings being created that are not necessary - a simple no-allocation slice would suffice in most cases.

For instance, most inline parsers have a line like this:

https://github.com/windows-toolkit/WindowsCommunityToolkit/blob/8d4fb2d43d6b3cfe2fea7c05c487b93ece710861/Microsoft.Toolkit.Parsers/Markdown/Inlines/BoldItalicTextInline.cs#L58

Which could basically always be replaced by a Span<T>, reducing both the execution time as well as the amount of objects the GC has to track. Granted, those are all short lived objects ending up in generation-0, but still, it'd be a nice optimization I think.

@Sergio0694
Copy link
Member

Sergio0694 commented Mar 17, 2020

After taking a look at this PR I think we could just remove the overhead caused by allocations when setting the new OriginalMarkdown property by calling Substring, by exposing that property as a ReadOnlyMemory<char> (wrapping the area of interest) instead of a string (the substring). Using a Span<T> (or ReadOnlySpan<T>) won't work in this case, since those are ref struct types, so they can't be stored as fields in a class (that'd cause them to end up in the managed heap).

The idea would be, instead of using markdown.Substring(start, length), we'd just use markdown.AsMemory(start, length). That'd require no allocations, and users would be able to get the actual string from that if they need it, by just calling ReadOnlySpan<T>.ToString() on that property.

This is possible since the type has a special handling when the wrapped object is string, returning the actual substring the user expects, see here). If for some reason we'd instead want to expose that property as a string, we could still save the unnecessary allocations when users don't need them by storing the ReadOnlyMemory<char> instance internally and leveraging lazy loading like so:

/// <summary>
/// Gets or sets the <see cref="ReadOnlyMemory{T}"/> for <see cref="OriginalMarkdown"/>.
/// </summary>
internal ReadOnlyMemory<char> OriginalMarkdownMemory { get; set; }

private string _OriginalMarkdown;

/// <summary>
/// Gets the original Markdown the element was parsed from.
/// </summary>
public string OriginalMarkdown => _OriginalMarkdown ??= OriginalMarkdownMemory.ToString();

Note: the OriginalMarkdown getter is not thread safe in this case, but assuming that'd likely be used on the UI thread anyway it shouldn't matter, plus even in a race condition there wouldn't be crashes, the worst that could happen would just be the allocation of an unnecessary copy of the target substring, which would just be discarded. The user wouldn't notice any differences. This approach though keeps things simpler by not requiring additional synchronization primitives.

This is more of an API-design preference issue, the main point is to avoid allocations for users that don't actually need this new OriginalMarkdown property. There is technically still a minor memory overhead even with this approach, since each inline object would be 12 bytes bigger (16 bytes bigger with the lazy loaded string property) on x86 (16/24 bytes on x64), but assuming the input strings would generally be much larger, this should be negligible anyway.

@michael-hawker
Copy link
Member

@Sergio0694 thanks for all your input here! I know this kind of expands the scope of this original PR for this feature, but I believe if I understand correctly your comments, but if we more broadly apply these techniques to provide the OriginalMarkdown property as a core piece of each parser, that we can improve the performance of the parsers themselves as we wouldn't be copying the strings with the substring calls everywhere, correct? However, this seems like a larger task/feature, eh?

@michael-hawker michael-hawker modified the milestones: 6.1, 7.0 Apr 2, 2020
@michael-hawker
Copy link
Member

Moving out of 6.1 release as think refactoring this would be a lot of work and may be superseded by issue #3200 if we just end up adopting a more robust parser. Hopefully a library like Markdig supports this type of feature already? @MihaZupan

@MihaZupan
Copy link

Markdig exposes a SourceSpan on every AST element, which is just a Range (currently a user would have to call Substring themselves).
We could expose a helper method returning ReadOnlyMemory<char> for that AST element, leaving the need for a string allocation up to the user.

If this Toolkit exposes a wrapper over the AST, it is free to expose a lazily-allocated (as @Sergio0694 suggested) substring if the user requests it.

I would hold off on exposing more API until a potential migration has been prototyped.

@michael-hawker
Copy link
Member

@joejoe87577 thanks for taking an initial look at his. I think we should just roll this effort into #3200, I've added your original issue as a requirement there. Are you ok with us closing this PR?

@joejoe87577
Copy link
Author

@michael-hawker yeah you can close this PR. Thanks.

@Sergio0694 Sergio0694 closed this Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants