fb2 (FictionBook) is an XML format that is described here
https://en.wikipedia.org/wiki/FictionBook.
There is entire Lenin in fb2 ON OUR SHARED GOOGLE DRIVE
and also at this link (will start downloading a zip archive):
http://vk.com/doc59246014_279951847?hash=335f67b31f178bb636&dl=4b35b954ea662c1f05
It's much easier to work with than MSWord, even without a parse. For instance, the headers of all even-numbered pages can be found by searching for:
</p><empty-line /><p>В. И. ЛЕНИН</p>
as in
</p><empty-line /><p>10</p><empty-line /><p>В. И. ЛЕНИН</p>
Odd-numbered pages contain the title of the work, in capitals, as in:
</p><empty-line /><p>11</p><empty-line /><p>НОВЫЕ ХОЗЯЙСТВЕННЫЕ ДВИЖЕНИЯ В КРЕСТЬЯНСКОЙ ЖИЗНИ</p>
A simple regular search can find all page numbers, descriptively:
</p><empty-line /><p>NUMERALS-ONLY</p><empty-line />
And then for odd numbers the next <p> element contains the title.
May be useful.
https://en.wikipedia.org/wiki/FictionBook.
There is entire Lenin in fb2 ON OUR SHARED GOOGLE DRIVE
and also at this link (will start downloading a zip archive):
http://vk.com/doc59246014_279951847?hash=335f67b31f178bb636&dl=4b35b954ea662c1f05
It's much easier to work with than MSWord, even without a parse. For instance, the headers of all even-numbered pages can be found by searching for:
</p><empty-line /><p>В. И. ЛЕНИН</p>
as in
</p><empty-line /><p>10</p><empty-line /><p>В. И. ЛЕНИН</p>
Odd-numbered pages contain the title of the work, in capitals, as in:
</p><empty-line /><p>11</p><empty-line /><p>НОВЫЕ ХОЗЯЙСТВЕННЫЕ ДВИЖЕНИЯ В КРЕСТЬЯНСКОЙ ЖИЗНИ</p>
A simple regular search can find all page numbers, descriptively:
</p><empty-line /><p>NUMERALS-ONLY</p><empty-line />
And then for odd numbers the next <p> element contains the title.
May be useful.