import codecs,re,urllib2
f = urllib2.urlopen('http://www.soemin.net/2009/04/font-encoding-detection-for-zawgyi-and.html')
htm=re.sub("(\d+);",lambda x:unichr(int(x.group(1))),f.read().decode("utf8"))
txt=re.findall('<div[^>]+post-body[^>]+>\s*(.*?)\s*<div[^>]+clear:\s*both[^>]+></div>',htm,re.DOTALL)[0]
codecs.open("crawl.txt", 'w+',"utf8").write(txt)
#its also convert #&4096; to က
results will be like this
ေဇာ္ဂ်ီနဲ့ ယူနီကုတ္ ၅.၁ ခြဲျခားျခင္း (Font Encoding Detection for Zawgyi and Unicode 5.1)
.....
အဓိကအားျဖင့္ကေတာ့
၁။ သေဝထိုး၊ ရရစ္၊ ရပင္းစတာေတြ နဲ့
....
.....
Cheers,
Cheers,
Soe Min
No comments:
Post a Comment