PDA

View Full Version : FAO: htmltidyutils



NeoMorph
October 20th, 2007, 03:25 AM
Hey Rob... Been using a couple of of your functions but some of them I don't know what they do.

What I'm looking for is one that will strip out all the HTML tags so it becomes pure text. I've managed to do some scraping but the text is riddled with tags that I need to zap is all.

BTW if it wasn't for that Programming in Lua book explaining pattern matching I wouldn't have been able to understand it... The manual is already been heavily used and has made life SO much easier. I'm beginning to understand Lua more and I'm really beginning to like it now.

Edit: WOW... Manual came into its own again...


page=string.gsub(page,"<.->","")

That's all it needs to remove ALL the html tags in one go! That's really cool. I thought I'd have to do loads of replace expressions to get rid of all the tags but that one line does the lot in one go. :D

Ron
October 20th, 2007, 08:39 AM
gotta love them Regular Expressions (RegExp)

blubberhoofd
October 20th, 2007, 12:35 PM
hi,

here's another usefull one

page = string.gsub(page, '%s(%s*)', ' ')

this will remove all the extra space characters you'll have in your string.

hope this helps ;)

NeoMorph
October 20th, 2007, 05:23 PM
hi,

here's another usefull one

page = string.gsub(page, '%s(%s*)', ' ')

this will remove all the extra space characters you'll have in your string.

hope this helps ;)

Lucky for me the Allmusic page has no extra spaces but I'm keeping that in my codesnippets file.

Thx again. I've done no scraping before trying here with Girder/Lua. It's pretty easy once you get your head around Regular Expressions and for years I had avoided them because I just couldn't understand them. The Programming in Lua manual was the first text that made it easily understandable for me.