# Question about "awk"



## simX (Jul 11, 2002)

How do I get awk to split a string after/before a certain character?

For example, if I have the string "9+(8+9+(6+4))", I want awk to split it into "9+", "(8+9+", "(6+4)", ")".... that is, I want awk to split the string BEFORE an open parenthesis, and AFTER a close parenthesis.

How would I accomplish that?  Of course, I want it to output to standard out (Terminal window), so that I can use this command in an AppleScript.

Another question: if I have a string like "3982ldlsieK3829", how would I have awk split the string after the first string of numbers?  I don't want it to split after each number, but after each string of numbers.  I also don't want it to split up the letters.  The result from the above string should be "3982","ldlsieK","3829" (on different lines, of course).

Thanks for any help!


----------



## howardm4 (Jul 11, 2002)

in perl (a superior language IMO) it'd be
something like (I'm sure this could be 
optimized more but this is just off my
head):

$_ = '9+(8+9+(6+4))';
@list = split("", $_);

foreach (@list) {
  if (/\(/) {
    print "$string\n";
    $string = '(';
    next;
  }
  elsif (/\)/) {
    print "$string)\n";
    undef $string;
    next;
  }
  else {
    $string .= $_;    
  }
}


you could do something like:
 @list = split(/[\(\)]/, $_);

  foreach (@list) {
    print "$_\n";
  }

but, as usual, split tends to eat the
character(s) that are the split basis so
that wouldn't do what you want.

If the object you're trying to split up
is always of the same structure but just
the numbers are different, then it becomes
a much different problem.

the last one would be

$oddstring = '1234abcD6789';
($a, $b, $c) = ($oddstring =~ /(\d+)([a-z]+)(\d+)/i);

(this lame software ate all my indentation!)


----------



## simX (Jul 11, 2002)

Haha... funny that you mention perl, because this WAS eventually going to be a CGI script.. I just wanted to understand how to do it first with UNIX, so that I could formulate how to do it in perl.

Basically what I'm trying to do is read in a chemical formula (something like "Mn(H2S)3H2SO4(SO)4"  yes I know that's a bogus formulat, but that's the kind of structure", and then have it output the number of atoms of each element in the formula.

So I'm just trying to think about how I would go about doing this.  Split the string so that you have things in parentheses by themselves... i.e.: "Mn","(H2S)","3H2SO4","(SO)","4".  And then I wanted to split something like "3H2SO4" into "3" and "H2SO4", but I didn't want to limit it to a single number at the beginning.

Wow, such a hard problem... heh, my brain's going to explode.  Time for lunch!  Yum! 

Thanks for the help, howardm4.  This should get me started. 

By the way, do you think you could explain how those perl scripts you wrote actually work?  That would be helpful too, as I'm not a perl expert.


----------



## howardm4 (Jul 11, 2002)

jeez, I'm such a moron sometimes.

$_ = 'Mn(h2s)3h2so4(so)4';
(@list) = ($_ =~ /(\(*\w+\)*)/g);

foreach (@list) {
# if the element starts w/ a number
# then effectively split it by putting
# it on a new line.  Otherwise just print.
# You'll need to adjust this to taste
  if (/^(\d+)(.+)/ {
    print "$1\n$2\n";
  }
  else {
    print "$_\n";
  }
}

willl do what you want now that you've
made it clear.


First off, your statement 'I'd rather do
it in Unix' doesn't make sense.  It's all
Unix.  You're using a specific tool that runs
under Unix.  The right tool for the right 
job.  You're not 'cheating' by using Perl
instead of AWK.

All the line of code does is setup a 
regular expression that defines what
we're looking for which is some amount
of alphanumeric characters optionally
surrounded by paren's.  The /g says
'do it repeatedly (globally)'.  The rest
of it simply says 'load the members of 
the array 'list' with the output of the 
regular expression match'.

So, at the end of it, @list (the array) has
what you want in each element (ie. 'Mn'
is first array element $list[0]).


----------



## simX (Jul 11, 2002)

It gives me two errors (I'll try to fix it on my own, but I dunno how I'll do):



> syntax error at ./chemform.pl line 11, near "/^(\d+)(.+)/ {"
> syntax error at ./chemform.pl line 14, near "else"
> Execution of ./chemform.pl aborted due to compilation errors.



UPDATE:  Nevermind, I got it.  You forgot a parenthesis on line 9  you need to close those parentheses. 

By the way, a couple more questions:

1) How do I get this script to accept input, not have a set formula put in?

2) What's a good online perl reference?

3) What exactly is the format of the regular expression you set up?  I'd like to know so I can expand the script later..

4) I see how you save values to variables, but how do you save them to arrays?  I'd like to get all of the stuff from the above script into a list _right now they're just being printed onscreen.


----------



## howardm4 (Jul 11, 2002)

> _Originally posted by simX _
> *It gives me two errors (I'll try to fix it on my own, but I dunno how I'll do):
> 
> 
> ...




You can say something like:

chomp($_ = $ARGV[0]);

which would take 1 command line argument
while removing the implied newline.

Teh man page for perl is quite complete or
perl.com or perl.org or buy the
'Programming Perl' book (really!)

(@list) = ($_ =~ /(\(*\w+\)*)/g); 

I can't really do a full tutorial on regex's
here.

(@list) is an lvalue in list context

$_  is the default variable space

=~ says 'do a regular expression operation'

/ start of regex definition

( start of grouping operator

\(* escaped '('  (since they are a grouping operator by default) taken 0 or more times

\w+ alphanumeric character 1 or more times

\)* again, escaped ')' 0 or more times

) end of grouping operator

/ end of regex definition

g - says 'do it globally' (repeatedly)

The grouping operator means if the regex
matches, then the result will be saved and  put into
one of the @list elements.

Regular expressions are one of the very
most important things in Perl in order to
really use it well.

As you can see, when I didn't understand
what you were trying to accomplish, it
was  a messy bit of bad programming.
By understanding the lexical structure
of the strings you were working w/, it 
becomes a one line solution.

One of the other very important things
is that the language understands its
context as to whether you are looking 
for a scalar answer or  an array/list
type answer and it'll change it  behavior

All of your chemical sub-strings ARE in
fact being saved to an array already.

Print out $list[0] and $list[1] if 
you don't believe me. 

print "$list[0]\n";

The '@list' refers to the ENTIRE array while
in order for you to get element X, you
specify '$list[X]'  (and dont confuse that
w/ array slice @list[X..Y]  )

Since we dont know how many sub-strings
will be extracted from the formula, we have
to do things in the most general way.
That is why I did:

(@list) = (..................)

and 

foreach (@list) 
   print $_
}

(inside the foreach loop, the variable
$_ is assigned to each array value in turn.)


----------



## simX (Jul 12, 2002)

> _Originally posted by howardm4 _
> *All of your chemical sub-strings ARE in
> fact being saved to an array already.
> 
> ...



When I print out the array, it's showing the strings "Mn","(H2S)","3H2SO4","(SO)","4".  It's saving things from the first regular expression thing, but it's not saving anything from the foreach loop, because it seems you are just printing things to the screen.  If it WAS storing them from the foreach loop, you should see a separate "3" and "H2SO4", not "3H2SO4", right?



> *Since we dont know how many sub-strings
> will be extracted from the formula, we have
> to do things in the most general way.
> That is why I did:
> ...



Got it.


----------



## howardm4 (Jul 12, 2002)

yes, in the foreach loop, I'm just printing
out the numeric, not saving it.

I gotta leave something for you to do 

of course, you could setup a hash array
w/ the molecule as the key and the numeric
as the value, such as:

(pretend we're in the foreach now)

$hash{$_} = $1;



but you are correct, if we were keeping the
'3', then you'd see that as a unique element
in the @list array.


----------



## simX (Jul 16, 2002)

howardm4:  I'm gonna poke your brain again.

I am reading through OReilly's "Learning Perl" book (very good, btw... now I actually UNDERSTAND perl and its regular expressions).

I'm trying to modify the perl prog this time so that it matches anything within the parentheses and then INCLUDES the number after it.

Here's the prog:



> #!/usr/bin/perl -w
> print "Enter a chemical formula: ";
> chomp ($ooga = <STDIN>);
> (@list) = ($ooga =~ /(\(*\w+?\)*\d*)(?=(A-Z]|$))/g);
> ...



The problem is, when I put in something like "SHeSHe", I get back this output:



> S
> H
> He
> S
> ...



Where did all that extra stuff come from?!! BTW, I know I didn't account for parentheses in the lookahead portion of the regular expression (BTW, these regular expressions are f***ing powerful!).

A little help?


----------



## howardm4 (Jul 16, 2002)

explain why 

(@list) = ($ooga =~ /(\(*\w+\)*\d*)/g);

doesn't accomplish your goal.

I'm not sure what your trying to
accomplish w/ that thing you came up w/.
You're trying to do some non-greedy 
matching and some other stuff (I dont
have my regex quickref w/ me) 
such as character classing and 
alternation w/ end of line (although
you dont seem to have the opening '['
for the character class (which would need to be there AND you'd need 'i' w/ the '/g'
so that you'd get case insensitive 
matching) but your
making more complex than need be IMO.
And regex's can get ugly QUICKLY.

What do YOU think that
(?=(A-Z]|$)
is doing for you?

Using my regex:

wm195:/tmp] howardm% perl -w junk2
Enter a chemical formula: Mn(h2s)3h2so4(so)4

Mn
(h2s)3
h2so4
(so)4


----------

