Stripping html code out of a source file

Darkshadow

wandering shadow
Is there an easy way to do this? I'm writing a small program for myself, and one of the steps requires me to strip the html code out of a source page.

I do need, however, to keep the alt part from the image tags - just the text that would be displayed if the images weren't displayed.
 
Just to clear this up a bit, what language are you using to write your small program, or haven't you decided yet?

You should be able to remove anything enclosed in < and > symbols, which will leave you with only the text. You will need to find a way to parse the ALT=" to " text into the final output, too.

How you do this depends on the language you are using.
 
Any language - whichever is the easiest. :D

Mostly I use shell scripts, but that only gives me sed - which is a great stripping program, but really bites at stripping html.

I know some PHP and PERL, so if you give me the relevent commands, I'm sure I can adapt them.
 
Perl would be the easiest for this sort of task; text processing is what it does best.
 
There are "text only" Web browsers (Lynx?) oput there.

It seems to me that you can simply use one of them to get the content including the ALT tags.

Also, can't you simply do a IE Save As (Format = Plain Text) to ge this?
 
Err...probably. I haven't used IE in a while. But no, that's not really an option - I'm doing this as part of a program, so I wouldn't be able to do it that way.
 
BBedit has both an AppleScript and command line interfaces... It also has a command to "Remove Markup"... I'm not sure if you can script that though...
 
Originally posted by Darkshadow
Is there an easy way to do this? I'm writing a small program for myself, and one of the steps requires me to strip the html code out of a source page.

I do need, however, to keep the alt part from the image tags - just the text that would be displayed if the images weren't displayed.

Take a look at this program. It's a MAC program which will convert your HTML files into text files.

http://www.printerport.com/klephacks/markdown.html
 
you could try links (or likely lynx too, I use links).

I know that "links -dump file://somefile.html" will get you formatted plaintext, including alt tags.
 
Back
Top