Mercurial > ppgen
comparison ppgen.py @ 7:8b2f8f439817
Improves: ding parser.
* Strips greater and lesser signs in the beginning and end of words
when reading a ding directory. Words enclosed by those characters seem
to be variants. This affects about 100 to 200 words for de in de-en 1.7.
author | Bernhard Reiter <bernhard@intevation.de> |
---|---|
date | Tue, 21 Feb 2017 14:14:08 +0100 |
parents | 81f75c9aac84 |
children | 200c2c3c5f67 |
comparison
equal
deleted
inserted
replaced
6:81f75c9aac84 | 7:8b2f8f439817 |
---|---|
100 # languages are separated by " :: " | 100 # languages are separated by " :: " |
101 p = line.partition(" :: ") | 101 p = line.partition(" :: ") |
102 languageEntry = p[0] if useLeft else p[2] | 102 languageEntry = p[0] if useLeft else p[2] |
103 | 103 |
104 for word in splitter.split(languageEntry): | 104 for word in splitter.split(languageEntry): |
105 word = word.strip('(",.)\'!:;').rstrip('/') | 105 word = word.strip('(",.)\'!:;<>').rstrip('/') |
106 if len(word) > 2 and not word[0] in '[{/': | 106 if len(word) > 2 and not word[0] in '[{/': |
107 dset.add(word) | 107 dset.add(word) |
108 | 108 |
109 #TODO: check for very common words and remove them? | 109 #TODO: check for very common words and remove them? |
110 | 110 |