regex content between multiple tags

tench

Registered
i have an xml segment which looks like this:

Code:
<s xml:id="X">Some text.</s><s xml:id="Y">Some other text.</s><s xml:id="Z">Some other text.</s>

I am new to regular expressions, but I would like to match the text between s tags.

I've tried with:
Code:
(<s.+>).+</s>
which is a good start, but it matches everything from the first s tag to the last one. Now, I realize that I need to enter the negation somewhere (i.e. tell the the regex engine what NOT to match) so that it would not allow all these other tags between the first and the last tag, but I haven't been able to come up with anything that works.

Any tips would be greatly appreciated.

All best,
Tench
 
This should do it:
(<s[^>]*>)([^<]*)</s>

That replaces you "." with any character except ">" (or "<" in the second instance). I also changed the + to * so it will match blankness, e.g., <s>some text</s> or <s></s>

The first parentheses will match <s xml:id="X">, the second will match Some text., and entire statement will match <s xml:id="X">Some text.</s>

One problem with this is that it won't work if you have tags embedded within your <s> tag. For example, it wouldn't properly match this:
<s xml:id="X">Some <b>BOLD</b> text.</s>
Because the <b> tag would trip it up. I really don't know how to account for that in a regular expression.


Another way to do this is to set it not to be a "greedy" search, so it will match the shortest string it can rather than the longest. Then your original expression would work just fine. But I'm not sure if what you're using allows you to specify that.
 
mikuro, you rock my world!
thank you so much for your explanation. you make regex sound beautiful! :)
all best,
tench
 
Only problem with that one is you'll match any opening tag that starts with an s. Like <section> and <sally>

(<s(>| [^>]*>))([^<]*)</s>

will force it to only match <s> tags, but still won't handle nested tags.

(<s(>| [^>]*>))(.*)</s>

will grab everything inside the outside <s> tag.

.* is greedy, but it won't eat the last </s> that occurs because that would prevent the entire match.
This one would have a problem with your sample string though, grabbing everything from the first opening to the last closing tag.

(<s(>| [^>]*>))(.*?)</s>

will grab the first tag only, but not nested ones.

I don't think its possible to handle both strings. Some1 else please help? :)
 
Back
Top