Sunday, June 28, 2015

Lenin in FB2 format

fb2 (FictionBook) is an XML format that is described here

https://en.wikipedia.org/wiki/FictionBook.

There is entire Lenin in fb2 ON OUR SHARED GOOGLE DRIVE
and also  at this link (will start downloading a zip archive):

http://vk.com/doc59246014_279951847?hash=335f67b31f178bb636&dl=4b35b954ea662c1f05

It's much easier to work with than MSWord, even without a parse. For instance, the headers of all even-numbered pages can be found by searching for:

</p><empty-line /><p>В. И. ЛЕНИН</p>
as in
</p><empty-line /><p>10</p><empty-line /><p>В. И. ЛЕНИН</p>

Odd-numbered pages contain the title of the work, in capitals, as in:
</p><empty-line /><p>11</p><empty-line /><p>НОВЫЕ ХОЗЯЙСТВЕННЫЕ ДВИЖЕНИЯ В КРЕСТЬЯНСКОЙ ЖИЗНИ</p>

A simple regular search can find all page numbers, descriptively:

</p><empty-line /><p>NUMERALS-ONLY</p><empty-line />

And then for odd numbers the next <p> element contains the title.

May be useful.

2 comments:

  1. Is there a well-defined way to determine line structure? As I've said, I was hoping to work with line numbers, but it looks like we have to stop with paragraphs within pages because that's where the XML stops. Well, okay. And then we use character counts, or word counts?

    ReplyDelete
  2. Thanks for the new format professor. Dr. Myers, would you mind elaborating? Word counts make sense to me because of the "within n words" query but why would we need character counts as well? (Also, I apologize if multiple versions of my post have appeared. It was unclear to me whether or not they had been published.)

    ReplyDelete

Note: Only a member of this blog may post a comment.