Parsing a Dictionary
I'm waist-deep in work for my diploma project, which involves parsing words in the Romanian language (I'm working on parts of a Romanian text-to-speech system). Rignt now I'm trying to figure out what parts of speech a given word can possibly be (noun, verb, etc). I'm using this amazing resource called dexonline which is the Romanian online dictionary. Their database has lots of interesting information, including inflection data and all the inflected forms of all words (this turns out to be very useful). Unfortunately, only words with multiple inflected forms have easily accessible information about the part of speech; for the others I need to parse the word's textual definition. It looks like this:
Thank goodness for regular expressions. All that time spent figuring out these magic incantations from The Camel Book paid off. It took me like 10 minutes to come up with this little gem:
^@[^@]+@.*?(?P<type_list>(?P<type>(?P<words>\s+(\w+\.)+)+)(,(?P<type2>(?P<words2>\s+(\w+\.)+)+))*)
Which gives me exactly what I want: "pron. nehot., adj. nehot., adv." - a string of abbreviations that describe the word. Also, thank goodness for linguists being pedantic about their dictionary syntax, otherwise this would not be parsable automatically (I haven't run the code through all the database yet, but I hope I don't get any nasty surprises).
Many thanks to the great people who spent countless hours building the dexonline.ro database, only to give it away licensed as GPL. You guys are amazing.
