LeninConc: 2016

Looking at the table of contents spreadsheet, which I saved as Vol_Title_StartPage_EndPage.csv, I find that there are thirty end-of-lines within the titles, i.e. unbalanced quote-marks. So I open the file in vi, search for /^[^"]*"[^"]*$/ and join each such line with the next until the problem goes away, and okay, now I have the TOC.js file which looks like this:

var TOC=[["номер тома", "название работы", "номер начальной страницы", "номер конечной страницы"],
["1", "НОВЫЕ ХОЗЯЙСТВЕННЫЕ ДВИЖЕНИЯ В КРЕСТЬЯНСКОЙ ЖИЗНИ. По поводу книги В. Е. Постникова — «Южно-русское крестьянское хозяйство»", "1", "66"],
["1", "ПО ПОВОДУ ТАК НАЗЫВАЕМОГО ВОПРОСА О РЫНКАХ", "67", "122"],
["1", "ЧТО ТАКОЕ «ДРУЗЬЯ НАРОДА» И КАК ОНИ ВОЮЮТ ПРОТИВ СОЦИАЛ-ДЕМОКРАТОВ? (Ответ на статьи «Русского Богатства» против марксистов)", "125", "346"],
["1", "ЭКОНОМИЧЕСКОЕ СОДЕРЖАНИЕ НАРОДНИЧЕСТВА И КРИТИКА ЕГО В КНИГЕ Г. СТРУВЕ (ОТРАЖЕНИЕ МАРКСИЗМА В БУРЖУАЗНОЙ ЛИТЕРАТУРЕ). По поводу книги П. Струве: «Критические заметки к вопросу об экономическом развитии России». ", "347", "534"],
["1", "ПОМЕТКИ, ВЫЧИСЛЕНИЯ И ПОДЧЕРКИВАНИЯ В. И. ЛЕНИНА В КНИГЕ В. Е. ПОСТНИКОВА «ЮЖНОРУССКОЕ КРЕСТЬЯНСКОЕ ХОЗЯЙСТВО»", "537", "546"],
["1", "ПРОШЕНИЯ В. И. УЛЬЯНОВА (ЛЕНИНА) ", "549", "562"],
["1", "Список работ В. И. Ленина, относящихся к 1891—1894 гг., до настоящего времени не разысканных ", "565", "566"],
["1", "Список работ, переведенных В. И. Лениным", "567", "567"],
["2", "ФРИДРИХ ЭНГЕЛЬС", "1", "14"],
["2", "ОБЪЯСНЕНИЕ ЗАКОНА О ШТРАФАХ, ВЗИМАЕМЫХ С РАБОЧИХ НА ФАБРИКАХ И ЗАВОДАХ", "15", "60"],
["2", "ГИМНАЗИЧЕСКИЕ ХОЗЯЙСТВА И ИСПРАВИТЕЛЬНЫЕ ГИМНАЗИИ («Русское Богатство») ", "61", "69"],
["2", "К РАБОЧИМ И РАБОТНИЦАМ ФАБРИКИ ТОРНТОНА", "70", "74"],
["2", "О ЧЕМ ДУМАЮТ НАШИ МИНИСТРЫ?", "75", "80"],
["2", "ПРОЕКТ И ОБЪЯСНЕНИЕ ПРОГРАММЫ СОЦИАЛ-ДЕМОКРАТИЧЕСКОЙ ПАРТИИ", "81", "110"],
["2", "ЦАРСКОМУ ПРАВИТЕЛЬСТВУ", "111", "116"],
...[there are 6382 lines total, at the moment, ending with]...
["54", "Л. Б. КАМЕНЕВУ. 7 ноября", "6", "6"],
["54", "В ЦЕНТРОПЕЧАТЬ, ИЗДАТЕЛЬСКИЕ ОТДЕЛЫ ВСНХ, НКЗЕМ, НКПС, НКПРОД 8 ноября", "7", "7"],
["54", "В. А. СМОЛЬЯНИНОВУ. 9 ноября", "7", "7"],
["54", "В. А. АВАНЕСОВУ. 9 ноября", "7", "8"],
["54", "В. А. СМОЛЬЯНИНОВУ. 9 ноября", "8", "8"],
["54", "НС. УНШЛИХТУ. 9 ноября", "8", "9"],
["54", "Д. Б. РЯЗАНОВУ. 9 ноября", "9", "9"],
["54", "В. М. МИХАЙЛОВУ ДЛЯ ЧЛЕНОВ ПОЛИТБЮРО ЦК РКП(б). 9 ноября", "9", "9"],
["54", "Г. М. КРЖИЖАНОВСКОМУ. 9 ноября", "10", "10"],
["54", "А. Б. ХАЛАТОВУ. 10 ноября", "10", "10"],
["54", "Н. П. ГОРБУНОВУ. 10 ноября", "10", "11"]];

The problem there is that when there's more than one document per page, this doesn't give me much help. How do I tell which title goes with which paragraph? Is there a reliable regular expression? When I look at that last, page 10 of volume 54, I see

10<empty-line />ЗАПИСКА П. П. ГОРБУНОВУ<empty-line />И ТЕЛЕГРАММА Л. Б. КРАСИНУ

т. Горбунов! Прошу отправить шифром Красину.

Ваша депеша от 1. XI почти истерична. Вы забыли, что уступить сразу Лесли Уркарту даже вы не предлагали, а решение Политбюро очень обдуманное и не есть<empty-line />* Имеется в виду 7 и 8 ноября. Ред.<empty-line />6<empty-line />В. И. ЛЕНИН

отказ. Относительно же «Фаундэйшн Компани» вам 29. X послано полное согласие и поручение спешить. Надо позаботиться о более быстром обмене телеграмм между нами: аппарат Внешторга вообще плоховат.

Ленин


Написано 7 ноября 1921 г.<empty-line />Послано в Лондон

Впервые напечатано в 1959 г.<empty-line />в Ленинском сборнике XXXVI

Печатается по рукописи

11<empty-line />...

meanwhile, line 24301 of the file says

20. Г. М. КРЖИЖАНОВСКОМУ. 9 ноября 10<empty-line />
21. А. Б. ХАЛАТОВУ. 10 ноября 10<empty-line />
22. Н. П. ГОРБУНОВУ. 10 ноября 10-11<empty-line />
23. И. И. РАДЧЕНКО. 10 ноября 11

and I suppose that's the relevant part? But I don't know how to work with this.

Also, I'm still working on "lemmaMode", which as I was explaining is trickier than I had originally realized. The basic queries are findPhrase (including a single word, and including finding multiple times or all the times it appears) and findWithin(A,B,N) for finding places where A and B occur within N of each other...I suppose A and B could be phrases rather than words, but I've only written code for them being words, in fact lemmas. Let's say we have

matchLiteralToLoc(W,loc) returns TRUE iff wordList[loc]==W
matchLiteralListToLoc([W0,W1,...Wn],loc) returns TRUE iff for each 0<=i<=N matchLiteralToLoc(Wi,loc+i) is TRUE;
getLemma(W) returns the lemma of W, or W itself if not found;
matchLemmaToLoc(lemma,loc) returns TRUE iff getLemma(wordList[loc])==lemma
findFirstLocOfLemma(L) returns the smallest N: matchLemmaToLoc(L,N), or -1 on failure
findFirstLocOfLiteral(W) returns the smallest N: matchLiteralToLoc(L,N), or -1 on failure
findFirstLocOfLiteralList([W0,...Wk]) returns the smallest N: matchLiteralListToLoc([W0..Wk],N), or -1 on failure
matchLemmaListToLoc([L0,L1,L2,..LN],loc) similarly returns TRUE iff 0<=i<=N => getLemma(wordList[loc+i])==Li
findFirstLocOfLemmaList(L) returns the smallest N: matchLemmaToLoc(L,N), or -1 on failure
findNextLocOfLemma(L,loc) fails if L!=getLemma(wordList[loc]; finds smallest N > loc : matchLemmaToLoc(L,N) is TRUE, or -1 on failure
findNextLocOfLemmaList([L0,...Lk],loc) finds smallest N > loc : matchLemmaListToLoc([L0,..Lk],N) or -1 on failure

Um....thinking. :-)

LeninConc

Tuesday, May 31, 2016

Deadlines summer 2016

Saturday, March 26, 2016

TOC and lemma+literal matching of words and phrases

Saturday, February 27, 2016

More on Data Structures