LeninConc: 2015

Saturday, September 19, 2015

RESTian query Interface?

It occurs to me that part of the point of this project is that it should provide an easy way to point to specific evidence of what Lenin's thinking was really all about...yes? And if so, then we want to maximize the power of URLs as our fundamental pointing mechanism. So the question arises: what kinds of URLs should be supported?

Thursday, July 2, 2015

from XML to Lenin-only to conc

I may have missed some earlier discussions and suggestions, but let me ask a question about data structures. Assuming that we start with an XML format (fb2), we have the following stages of the process:

1. For each volume, extract the "Lenin-only" content of the volume.
2. Merge the contents of all volumes.
3. Produce a conc from the merged file.

My question is: what data structures do we use in the Lenin-only content? JSON records that include volume, title of work, page number? perhaps paragraph breaks? Perhaps number all paragraphs consecutively?

Sunday, June 28, 2015

Lenin in FB2 format

fb2 (FictionBook) is an XML format that is described here

https://en.wikipedia.org/wiki/FictionBook.

There is entire Lenin in fb2 ON OUR SHARED GOOGLE DRIVE
and also at this link (will start downloading a zip archive):

http://vk.com/doc59246014_279951847?hash=335f67b31f178bb636&dl=4b35b954ea662c1f05

It's much easier to work with than MSWord, even without a parse. For instance, the headers of all even-numbered pages can be found by searching for:

<empty-line />В. И. ЛЕНИН
as in
<empty-line />10<empty-line />В. И. ЛЕНИН

Odd-numbered pages contain the title of the work, in capitals, as in:
<empty-line />11<empty-line />НОВЫЕ ХОЗЯЙСТВЕННЫЕ ДВИЖЕНИЯ В КРЕСТЬЯНСКОЙ ЖИЗНИ

A simple regular search can find all page numbers, descriptively:

<empty-line />NUMERALS-ONLY<empty-line />

And then for odd numbers the next element contains the title.

May be useful.

Thursday, June 25, 2015

Output Structures

While John works on the generation of a working-data-structure set, let's go back to the description of the output. We're thinking about KWIC and Concordance structures, but I take it that the actual purpose is to be able to answer questions like those previously proposed in LeninConc: Searching the concordance

1. Define a sub-corpus by start-end dates, or by the title of Lenin's work, or by the volume.
2. Search by a "dictionary word" (e.g., дом) or by a specific word form (домов).
3. Search by a set of grammatical features (Singular Noun in the Genitive Case)
4. Search by two words separated at most by N words, both possibly constrained by grammatical features.

To be continued.

I haven't seen the Lenin files; I'd like to assume that they can be mechanically broken into
Volume--Work--Chapter--Section--Paragraph--Line--Word
or some similar set of nested structures. (Note that I'm choosing "Paragraph" rather than "Page" and yet "Line" rather than "Sentence"...this is debatable, but it does have the advantage that they correspond naturally to div structures in the HTML.) I would propose that we do this breakdown and then, except for Word at the bottom, we generate a copy of each volume as a single HTML file with a div for each subdivision, with id values representing location in the structure: volume 7's work 2, chapter 1, section 8, paragraph 3, line 6 might be:
<div id="L7_2_1_8_3_6">for all good men to come to the aid of </div>
and from this you know that the enclosing paragraph has id L7_2_1_8_3 and that the entire chapter can be found within the div whose id is L7_2_1. Yes, the initial "L7_" is redundant in that every id in the volume file will begin the same way, but this gives us working sets of ids for more than one volume, which I think is likely to be necessary. At any rate,
http://servername.com/leninconc/volname#L7_2_1_8_3_6
will simply scroll to the right place, with no further work to be done. A query result can now be an id, sometimes a pair of ids indicating beginning and end of a range, or possibly sometimes an id with a numeric offset to indicate that the range begins or ends with a particular character count.

The actual interface, then, might be a web page with John's JSON files loaded to represent the sub-corpus active at any given moment, and with query-result display either

in an iframe holding one volume's HTML file, with scroll and selection set to display the right stuff, or
in a constructed div holding exactly the text(s) requested, copied from one or more volumes in iframes which are, for the moment, invisible.

Does this make any sense? I am not a linguist, nor do I play one on the internet....

Sunday, May 17, 2015

Data Structures?

I think my main role here is Maker Of Annoying Suggestions, at least for the moment, and for the moment I'm going to suggest that the Works of Lenin (which you were saying, if I recall correctly which usually I don't, would be 50-some plain-text roughly-a-megabyte volumes, each with a few Works, or was it 50-some Works?) can go in a folder, one file per volume¹, along with a set of JSON files, one per Work, which are loaded at need by the interactive part of the project.

(update: I'm assuming standard Python JSON; there are SAX-like streaming libraries for dealing with multi-gigabyte files, but looking at How Big is TOO BIG for JSON? suggests to me that we probably needn't worry about this.)

The top-level JSON file, top.json, contains one list of dicts, each with metadata about one Work, including which files store its info. That lets you focus on one or two or ten Works at a time, based on date-ranges or explicit inclusion or previous searches or whatever. Think of it as roughly

[{name:"Work1",source_file:"volume1.txt",dataFile:"Work1.json",start:"1896",end:"1898"},
 {name:"Work2",source_file:"volume1.txt","dataFile:"Work2.json" ... }, ....
]

Work1.json, like each of its siblings, will contain a list of lists, a dict of ints and a dict of dicts². Let me go back to imagining "The quick brown fox jumps over the lazy dog." as Work1. Okay, we have

{
  occurrence:[["the", "The", 6], ["quick",null,0],["brown",null,0],["fox",null,0],["jump","jumps",0],....],
  firstOccur:{"brown":2, "dog":8, "fox":3, "jump":4, "lazy":7, "over":5, "quick":1,"the":0},
  gramInfo:{"the":{pos:"article",...},"jump":{pos:"verb",...} ... }
}

The occurrence table has one row for every occurrence of a word in the Work, even a trivial word. It has at least three columns: one for the normal form of the word which we use for lookup and search, the next for the actual appearance of the word in this particular occurrence, and the third is the location of the next occurrence of that (normal form of) the word, if any. "the" appears at location 0 in the form "The" , and at location 6 in its normal form; "jump" appears at location 4 in the form "jumps" ; everything else appears in its normal form, and only once. Now if we look for the word the within three words of the word dog , we can do it in a reasonable time. Here I'm no longer dropping out the trivial words (I am dropping out punctuation, which I suppose should get a fourth column in the occurrence table.)

It's possible that this should all go in a database; I admit I was thinking that... but it's not big enough to make the overhead worthwhile, and grammatical info doesn't go well into database structures because each row ends up having different properties for which you need to record values. In fact until I thought about the "words within N words of each other" problem just this morning, I thought of this as mainly a KWIC index; basically a list of 5-tuples with one tuple for each non-trivial word. So if there are 200,000 words in a volume, and 50,000 are occurrences of maybe 20K distinct non-trivial words, that's a list 50K * 5 which could be 50K database records or just an array loaded into memory; it's small. That's made somewhat more complex by the preservation of context (trivial words + punctuation marks) and then made considerably more complex by the grammatical info attached to each word, which could be stored separately as 20K rather poorly-structured records, maybe a dict of 20K small dicts... The grammatical info is used in search, so it has to be pre-computed and stored.

Now I'm thinking that one work, with say 100,000 words including the trivial, can simply go into the kind of structure above. But of course it depends on file sizes and Python overhead, and it depends more critically on the kind of grammatical output to be stored and searched-on...is a dict actually practical for this? I don't know, I haven't seen the grammatical info.

(End of annoying suggestions. At least for this morning, because it's 11:58AM.)
Correction: I typed "one file per Work" when I meant "one file per volume". At least I think that...but then, maybe it should be an HTML file with lots of anchors for easy positioning, rather than a text file; one way of expressing a search result is as a collection of links to anchors.
Correction: I have no idea why I typed "a dict and two lists" and then wrote out the example correctly (I think). Oops.

Thursday, May 14, 2015

Searching the concordance

Concordance can be considered a restricted kind of corpus, and its query capabilities can be modeled after those of the Russian National Corpus (RNC) / Национальный корпус русского языка http://www.ruscorpora.ru/en/index.html. Some of those capabilities are:

1. Define a sub-corpus by start-end dates, or by the title of Lenin's work, or by the volume.
2. Search by a "dictionary word" (e.g., дом) or by a specific word form (домов).
3. Search by a set of grammatical features (Singular Noun in the Genitive Case)
4. Search by two words separated at most by N words, both possibly constrained by grammatical features.

To be continued.

Monday, May 11, 2015

Volume structure

three Russian words:

ЛЕНИН Lenin

ПРЕДИСЛОВИЕ Introduction

ПРИМЕЧАНИЯ Notes

We have complete works in 55 volumes. Each volume has this structure. (There is a sample, vol03.doc in LeninAndFriends.)

BOILERPLATE beginning with: Proletarians of the world, unite!; a portrait, etc

ПРЕДИСЛОВИЕ, to be ignored, numbered in Roman numerals. A header at the top says "XII ПРЕДИСЛОВИЕ" on even pages or "ПРЕДИСЛОВИЕ XI" on odd pages.

The Introduction always concludes with this line:

Институт марксизма-ленинизма при ЦК КПСС

Institute of Marxism-Leninism at the CC CPSU (Central Committee of the Communist Party of the Soviet Union)

Then there is some variation of

В. И. ЛЕНИН

1897

which is a caption under the portrait of Lenin used in the volume. The dates will monotonically increase.

Between the portrait and ПРИМЕЧАНИЯ are one or more WORKS (usually more than one). Each work starts with a preamble that gives the title, the source (ms or previous publication), and dates. Then the text begins. Pages are numbered. Each page has a header: <number><title> on even-numbered pages or <title><page> on odd-numbered ones. Occasionally there are tables that should be recognized and skipped, I'm not sure how. And that is all.