LeninConc: More on Data Structures

While going over code today, I realized that I can shrink some data files quite a bit by excluding redundancy...when I defined the occurrence[i] as

[theLemma,theWordAsItAppearsHere,theNextLocationOfThatLemma]

I was forgetting that the lemma for a given word is already stored in gramInfo, so I can simply leave out the first column of that table. Yay! In fact I am editing to have just the wordlist as one data structure, i.e. just a list of words as they occur, and the corresponding nextLocation list as another data structure, just a list of numbers. That will make them more compressible, which might help performance if we store these as zip files.

I also realized there's a small problem, or maybe a big one, but anyway one I can't do much about: the search-by-lemma is limited to lemmas for those words which actually occur in the data, because otherwise they won't be in gramInfo and getLemma will just say, for instance, if we look for "jumped over" in the sample "The,quick,brown,fox,jumps,over...", that

getLemma("jumped")="JUMPED"

There's no way that it can tell that the answer should be JUMP, because it isn't running the lemmatizer; it's only looking up precomputed answers in gramInfo. So it won't find anything, although it should find that "jumps, over" is a match at location 4. To fix that, we would have to install the lemmatizer as a web service and invoke it via HTTP for each query.

LeninConc

Saturday, February 27, 2016

More on Data Structures

No comments:

Post a Comment