I think my main role here is Maker Of Annoying Suggestions, at least for the moment, and for the moment I'm going to suggest that the Works of Lenin (which you were saying, if I recall correctly which usually I don't, would be 50-some plain-text roughly-a-megabyte volumes, each with a few Works, or was it 50-some Works?) can go in a folder, one file per volume1, along with a set of JSON files, one per Work, which are loaded at need by the interactive part of the project.
(update: I'm assuming standard Python JSON; there are SAX-like streaming libraries for dealing with multi-gigabyte files, but looking at How Big is TOO BIG for JSON? suggests to me that we probably needn't worry about this.)
The top-level JSON file, top.json, contains one list of dicts, each with metadata about one Work, including which files store its info. That lets you focus on one or two or ten Works at a time, based on date-ranges or explicit inclusion or previous searches or whatever. Think of it as roughly
It's possible that this should all go in a database; I admit I was thinking that... but it's not big enough to make the overhead worthwhile, and grammatical info doesn't go well into database structures because each row ends up having different properties for which you need to record values. In fact until I thought about the "words within N words of each other" problem just this morning, I thought of this as mainly a KWIC index; basically a list of 5-tuples with one tuple for each non-trivial word. So if there are 200,000 words in a volume, and 50,000 are occurrences of maybe 20K distinct non-trivial words, that's a list 50K * 5 which could be 50K database records or just an array loaded into memory; it's small. That's made somewhat more complex by the preservation of context (trivial words + punctuation marks) and then made considerably more complex by the grammatical info attached to each word, which could be stored separately as 20K rather poorly-structured records, maybe a dict of 20K small dicts... The grammatical info is used in search, so it has to be pre-computed and stored.
Now I'm thinking that one work, with say 100,000 words including the trivial, can simply go into the kind of structure above. But of course it depends on file sizes and Python overhead, and it depends more critically on the kind of grammatical output to be stored and searched-on...is a dict actually practical for this? I don't know, I haven't seen the grammatical info.
(End of annoying suggestions. At least for this morning, because it's 11:58AM.)
Correction: I typed "one file per Work" when I meant "one file per volume". At least I think that...but then, maybe it should be an HTML file with lots of anchors for easy positioning, rather than a text file; one way of expressing a search result is as a collection of links to anchors.
Correction: I have no idea why I typed "a dict and two lists" and then wrote out the example correctly (I think). Oops.
(update: I'm assuming standard Python JSON; there are SAX-like streaming libraries for dealing with multi-gigabyte files, but looking at How Big is TOO BIG for JSON? suggests to me that we probably needn't worry about this.)
The top-level JSON file, top.json, contains one list of dicts, each with metadata about one Work, including which files store its info. That lets you focus on one or two or ten Works at a time, based on date-ranges or explicit inclusion or previous searches or whatever. Think of it as roughly
Work1.json, like each of its siblings, will contain a list of lists, a dict of ints and a dict of dicts2. Let me go back to imagining "The quick brown fox jumps over the lazy dog." as Work1. Okay, we have
[{name:"Work1",source_file:"volume1.txt",dataFile:"Work1.json",start:"1896",end:"1898"}, {name:"Work2",source_file:"volume1.txt","dataFile:"Work2.json" ... }, .... ]
The
{ occurrence:[["the", "The", 6], ["quick",null,0],["brown",null,0],["fox",null,0],["jump","jumps",0],....], firstOccur:{"brown":2, "dog":8, "fox":3, "jump":4, "lazy":7, "over":5, "quick":1,"the":0}, gramInfo:{"the":{pos:"article",...},"jump":{pos:"verb",...} ... } }
occurrence
table has one row for every occurrence of a word in the Work, even a trivial word. It has at least three columns: one for the normal form of the word which we use for lookup and search, the next for the actual appearance of the word in this particular occurrence, and the third is the location of the next occurrence of that (normal form of) the word, if any.
"the"
appears at location 0 in the form
"The"
, and at location 6 in its normal form;
"jump"
appears at location 4 in the form
"jumps"
; everything else appears in its normal form, and only once. Now if we look for the word
the
within three words of the word
dog
, we can do it in a reasonable time.
Here I'm no longer dropping out the trivial words (I am dropping out punctuation, which I suppose should get a fourth column in the
occurrence
table.)It's possible that this should all go in a database; I admit I was thinking that... but it's not big enough to make the overhead worthwhile, and grammatical info doesn't go well into database structures because each row ends up having different properties for which you need to record values. In fact until I thought about the "words within N words of each other" problem just this morning, I thought of this as mainly a KWIC index; basically a list of 5-tuples with one tuple for each non-trivial word. So if there are 200,000 words in a volume, and 50,000 are occurrences of maybe 20K distinct non-trivial words, that's a list 50K * 5 which could be 50K database records or just an array loaded into memory; it's small. That's made somewhat more complex by the preservation of context (trivial words + punctuation marks) and then made considerably more complex by the grammatical info attached to each word, which could be stored separately as 20K rather poorly-structured records, maybe a dict of 20K small dicts... The grammatical info is used in search, so it has to be pre-computed and stored.
Now I'm thinking that one work, with say 100,000 words including the trivial, can simply go into the kind of structure above. But of course it depends on file sizes and Python overhead, and it depends more critically on the kind of grammatical output to be stored and searched-on...is a dict actually practical for this? I don't know, I haven't seen the grammatical info.
(End of annoying suggestions. At least for this morning, because it's 11:58AM.)
Correction: I typed "one file per Work" when I meant "one file per volume". At least I think that...but then, maybe it should be an HTML file with lots of anchors for easy positioning, rather than a text file; one way of expressing a search result is as a collection of links to anchors.
Correction: I have no idea why I typed "a dict and two lists" and then wrote out the example correctly (I think). Oops.
Looking for normal form"the" within 3 words of normal form "dog", we need something like
ReplyDeletefunction findWithin(A,B,N){
var locA=firstOccur[A], locB=firstOccur[B];
// note that occurrence[locA] == [A, A', nextA]
// where A' is the non-normalized instance of A at that point, e.g. "The" rather than "the"
if(null==locA || null==locB)return null;
if(locA < locB) return findWithinOrd(locA,locB,N);
else return findWithinOrd(locB,locA,N);
}
function findWithinOrd(locX,locY,N){
if(locY-locX <= N) return [locX, locY];
var nextX = occurrence[locX][2];
if(nextX == 0) return null; // fail
if(nextX < locY) return findWithinOrd(nextX,locY,N);
else return findWithinOrd(locY,nextX,N);
}
findWithin("the","dog",3)
locA=0;locB=8;
neither is null;
findWithinOrd(0,8,N)
false
nextX=occurrence[0][2]=6;
(6<8)
findWithinOrd(6,8,N);
return [6,8]
return [6,8]
return [6,8]
so the result is the pair [6,8] and we can look at occurrence[6] and occurrence[8] to see which word is which. Something like that.