LeninConc

Tuesday, May 31, 2016

Deadlines summer 2016

For ELAN and FLEx to MannX (EAF + LIFT to html5):
abstract deadline June 5
demo deadline (including servlets and User Guide) July 10.

For LeninConc demo: August 20.

Saturday, March 26, 2016

TOC and lemma+literal matching of words and phrases

Looking at the table of contents spreadsheet, which I saved as Vol_Title_StartPage_EndPage.csv, I find that there are thirty end-of-lines within the titles, i.e. unbalanced quote-marks. So I open the file in vi, search for /^[^"]*"[^"]*$/ and join each such line with the next until the problem goes away, and okay, now I have the TOC.js file which looks like this:

var TOC=[["номер тома", "название работы", "номер начальной страницы", "номер конечной страницы"],
["1", "НОВЫЕ ХОЗЯЙСТВЕННЫЕ ДВИЖЕНИЯ В КРЕСТЬЯНСКОЙ ЖИЗНИ. По поводу книги В. Е. Постникова — «Южно-русское крестьянское хозяйство»", "1", "66"],
["1", "ПО ПОВОДУ ТАК НАЗЫВАЕМОГО ВОПРОСА О РЫНКАХ", "67", "122"],
["1", "ЧТО ТАКОЕ «ДРУЗЬЯ НАРОДА» И КАК ОНИ ВОЮЮТ ПРОТИВ СОЦИАЛ-ДЕМОКРАТОВ? (Ответ на статьи «Русского Богатства» против марксистов)", "125", "346"],
["1", "ЭКОНОМИЧЕСКОЕ СОДЕРЖАНИЕ НАРОДНИЧЕСТВА И КРИТИКА ЕГО В КНИГЕ Г. СТРУВЕ (ОТРАЖЕНИЕ МАРКСИЗМА В БУРЖУАЗНОЙ ЛИТЕРАТУРЕ). По поводу книги П. Струве: «Критические заметки к вопросу об экономическом развитии России». ", "347", "534"],
["1", "ПОМЕТКИ, ВЫЧИСЛЕНИЯ И ПОДЧЕРКИВАНИЯ В. И. ЛЕНИНА В КНИГЕ В. Е. ПОСТНИКОВА «ЮЖНОРУССКОЕ КРЕСТЬЯНСКОЕ ХОЗЯЙСТВО»", "537", "546"],
["1", "ПРОШЕНИЯ В. И. УЛЬЯНОВА (ЛЕНИНА) ", "549", "562"],
["1", "Список работ В. И. Ленина, относящихся к 1891—1894 гг., до настоящего времени не разысканных ", "565", "566"],
["1", "Список работ, переведенных В. И. Лениным", "567", "567"],
["2", "ФРИДРИХ ЭНГЕЛЬС", "1", "14"],
["2", "ОБЪЯСНЕНИЕ ЗАКОНА О ШТРАФАХ, ВЗИМАЕМЫХ С РАБОЧИХ НА ФАБРИКАХ И ЗАВОДАХ", "15", "60"],
["2", "ГИМНАЗИЧЕСКИЕ ХОЗЯЙСТВА И ИСПРАВИТЕЛЬНЫЕ ГИМНАЗИИ («Русское Богатство») ", "61", "69"],
["2", "К РАБОЧИМ И РАБОТНИЦАМ ФАБРИКИ ТОРНТОНА", "70", "74"],
["2", "О ЧЕМ ДУМАЮТ НАШИ МИНИСТРЫ?", "75", "80"],
["2", "ПРОЕКТ И ОБЪЯСНЕНИЕ ПРОГРАММЫ СОЦИАЛ-ДЕМОКРАТИЧЕСКОЙ ПАРТИИ", "81", "110"],
["2", "ЦАРСКОМУ ПРАВИТЕЛЬСТВУ", "111", "116"],
...[there are 6382 lines total, at the moment, ending with]...
["54", "Л. Б. КАМЕНЕВУ. 7 ноября", "6", "6"],
["54", "В ЦЕНТРОПЕЧАТЬ, ИЗДАТЕЛЬСКИЕ ОТДЕЛЫ ВСНХ, НКЗЕМ, НКПС, НКПРОД 8 ноября", "7", "7"],
["54", "В. А. СМОЛЬЯНИНОВУ. 9 ноября", "7", "7"],
["54", "В. А. АВАНЕСОВУ. 9 ноября", "7", "8"],
["54", "В. А. СМОЛЬЯНИНОВУ. 9 ноября", "8", "8"],
["54", "НС. УНШЛИХТУ. 9 ноября", "8", "9"],
["54", "Д. Б. РЯЗАНОВУ. 9 ноября", "9", "9"],
["54", "В. М. МИХАЙЛОВУ ДЛЯ ЧЛЕНОВ ПОЛИТБЮРО ЦК РКП(б). 9 ноября", "9", "9"],
["54", "Г. М. КРЖИЖАНОВСКОМУ. 9 ноября", "10", "10"],
["54", "А. Б. ХАЛАТОВУ. 10 ноября", "10", "10"],
["54", "Н. П. ГОРБУНОВУ. 10 ноября", "10", "11"]];

The problem there is that when there's more than one document per page, this doesn't give me much help. How do I tell which title goes with which paragraph? Is there a reliable regular expression? When I look at that last, page 10 of volume 54, I see

10<empty-line />ЗАПИСКА П. П. ГОРБУНОВУ<empty-line />И ТЕЛЕГРАММА Л. Б. КРАСИНУ

т. Горбунов! Прошу отправить шифром Красину.

Ваша депеша от 1. XI почти истерична. Вы забыли, что уступить сразу Лесли Уркарту даже вы не предлагали, а решение Политбюро очень обдуманное и не есть<empty-line />* Имеется в виду 7 и 8 ноября. Ред.<empty-line />6<empty-line />В. И. ЛЕНИН

отказ. Относительно же «Фаундэйшн Компани» вам 29. X послано полное согласие и поручение спешить. Надо позаботиться о более быстром обмене телеграмм между нами: аппарат Внешторга вообще плоховат.

Ленин


Написано 7 ноября 1921 г.<empty-line />Послано в Лондон

Впервые напечатано в 1959 г.<empty-line />в Ленинском сборнике XXXVI

Печатается по рукописи

11<empty-line />...

meanwhile, line 24301 of the file says

20. Г. М. КРЖИЖАНОВСКОМУ. 9 ноября 10<empty-line />
21. А. Б. ХАЛАТОВУ. 10 ноября 10<empty-line />
22. Н. П. ГОРБУНОВУ. 10 ноября 10-11<empty-line />
23. И. И. РАДЧЕНКО. 10 ноября 11

and I suppose that's the relevant part? But I don't know how to work with this.

Also, I'm still working on "lemmaMode", which as I was explaining is trickier than I had originally realized. The basic queries are findPhrase (including a single word, and including finding multiple times or all the times it appears) and findWithin(A,B,N) for finding places where A and B occur within N of each other...I suppose A and B could be phrases rather than words, but I've only written code for them being words, in fact lemmas. Let's say we have

matchLiteralToLoc(W,loc) returns TRUE iff wordList[loc]==W
matchLiteralListToLoc([W0,W1,...Wn],loc) returns TRUE iff for each 0<=i<=N matchLiteralToLoc(Wi,loc+i) is TRUE;
getLemma(W) returns the lemma of W, or W itself if not found;
matchLemmaToLoc(lemma,loc) returns TRUE iff getLemma(wordList[loc])==lemma
findFirstLocOfLemma(L) returns the smallest N: matchLemmaToLoc(L,N), or -1 on failure
findFirstLocOfLiteral(W) returns the smallest N: matchLiteralToLoc(L,N), or -1 on failure
findFirstLocOfLiteralList([W0,...Wk]) returns the smallest N: matchLiteralListToLoc([W0..Wk],N), or -1 on failure
matchLemmaListToLoc([L0,L1,L2,..LN],loc) similarly returns TRUE iff 0<=i<=N => getLemma(wordList[loc+i])==Li
findFirstLocOfLemmaList(L) returns the smallest N: matchLemmaToLoc(L,N), or -1 on failure
findNextLocOfLemma(L,loc) fails if L!=getLemma(wordList[loc]; finds smallest N > loc : matchLemmaToLoc(L,N) is TRUE, or -1 on failure
findNextLocOfLemmaList([L0,...Lk],loc) finds smallest N > loc : matchLemmaListToLoc([L0,..Lk],N) or -1 on failure

Um....thinking. :-)

Saturday, February 27, 2016

More on Data Structures

While going over code today, I realized that I can shrink some data files quite a bit by excluding redundancy...when I defined the occurrence[i] as

[theLemma,theWordAsItAppearsHere,theNextLocationOfThatLemma]

I was forgetting that the lemma for a given word is already stored in gramInfo, so I can simply leave out the first column of that table. Yay! In fact I am editing to have just the wordlist as one data structure, i.e. just a list of words as they occur, and the corresponding nextLocation list as another data structure, just a list of numbers. That will make them more compressible, which might help performance if we store these as zip files.

I also realized there's a small problem, or maybe a big one, but anyway one I can't do much about: the search-by-lemma is limited to lemmas for those words which actually occur in the data, because otherwise they won't be in gramInfo and getLemma will just say, for instance, if we look for "jumped over" in the sample "The,quick,brown,fox,jumps,over...", that

getLemma("jumped")="JUMPED"

There's no way that it can tell that the answer should be JUMP, because it isn't running the lemmatizer; it's only looking up precomputed answers in gramInfo. So it won't find anything, although it should find that "jumps, over" is a match at location 4. To fix that, we would have to install the lemmatizer as a web service and invoke it via HTTP for each query.

Saturday, September 19, 2015

RESTian query Interface?

It occurs to me that part of the point of this project is that it should provide an easy way to point to specific evidence of what Lenin's thinking was really all about...yes? And if so, then we want to maximize the power of URLs as our fundamental pointing mechanism. So the question arises: what kinds of URLs should be supported?

Thursday, July 2, 2015

from XML to Lenin-only to conc

I may have missed some earlier discussions and suggestions, but let me ask a question about data structures. Assuming that we start with an XML format (fb2), we have the following stages of the process:

1. For each volume, extract the "Lenin-only" content of the volume.
2. Merge the contents of all volumes.
3. Produce a conc from the merged file.

My question is: what data structures do we use in the Lenin-only content? JSON records that include volume, title of work, page number? perhaps paragraph breaks? Perhaps number all paragraphs consecutively?

Sunday, June 28, 2015

Lenin in FB2 format

fb2 (FictionBook) is an XML format that is described here

https://en.wikipedia.org/wiki/FictionBook.

There is entire Lenin in fb2 ON OUR SHARED GOOGLE DRIVE
and also at this link (will start downloading a zip archive):

http://vk.com/doc59246014_279951847?hash=335f67b31f178bb636&dl=4b35b954ea662c1f05

It's much easier to work with than MSWord, even without a parse. For instance, the headers of all even-numbered pages can be found by searching for:

<empty-line />В. И. ЛЕНИН
as in
<empty-line />10<empty-line />В. И. ЛЕНИН

Odd-numbered pages contain the title of the work, in capitals, as in:
<empty-line />11<empty-line />НОВЫЕ ХОЗЯЙСТВЕННЫЕ ДВИЖЕНИЯ В КРЕСТЬЯНСКОЙ ЖИЗНИ

A simple regular search can find all page numbers, descriptively:

<empty-line />NUMERALS-ONLY<empty-line />

And then for odd numbers the next element contains the title.

May be useful.

Thursday, June 25, 2015

Output Structures

While John works on the generation of a working-data-structure set, let's go back to the description of the output. We're thinking about KWIC and Concordance structures, but I take it that the actual purpose is to be able to answer questions like those previously proposed in LeninConc: Searching the concordance

1. Define a sub-corpus by start-end dates, or by the title of Lenin's work, or by the volume.
2. Search by a "dictionary word" (e.g., дом) or by a specific word form (домов).
3. Search by a set of grammatical features (Singular Noun in the Genitive Case)
4. Search by two words separated at most by N words, both possibly constrained by grammatical features.

To be continued.

I haven't seen the Lenin files; I'd like to assume that they can be mechanically broken into
Volume--Work--Chapter--Section--Paragraph--Line--Word
or some similar set of nested structures. (Note that I'm choosing "Paragraph" rather than "Page" and yet "Line" rather than "Sentence"...this is debatable, but it does have the advantage that they correspond naturally to div structures in the HTML.) I would propose that we do this breakdown and then, except for Word at the bottom, we generate a copy of each volume as a single HTML file with a div for each subdivision, with id values representing location in the structure: volume 7's work 2, chapter 1, section 8, paragraph 3, line 6 might be:
<div id="L7_2_1_8_3_6">for all good men to come to the aid of </div>
and from this you know that the enclosing paragraph has id L7_2_1_8_3 and that the entire chapter can be found within the div whose id is L7_2_1. Yes, the initial "L7_" is redundant in that every id in the volume file will begin the same way, but this gives us working sets of ids for more than one volume, which I think is likely to be necessary. At any rate,
http://servername.com/leninconc/volname#L7_2_1_8_3_6
will simply scroll to the right place, with no further work to be done. A query result can now be an id, sometimes a pair of ids indicating beginning and end of a range, or possibly sometimes an id with a numeric offset to indicate that the range begins or ends with a particular character count.

The actual interface, then, might be a web page with John's JSON files loaded to represent the sub-corpus active at any given moment, and with query-result display either

in an iframe holding one volume's HTML file, with scroll and selection set to display the right stuff, or
in a constructed div holding exactly the text(s) requested, copied from one or more volumes in iframes which are, for the moment, invisible.

Does this make any sense? I am not a linguist, nor do I play one on the internet....