LeninConc: July 2015

I may have missed some earlier discussions and suggestions, but let me ask a question about data structures. Assuming that we start with an XML format (fb2), we have the following stages of the process:

1. For each volume, extract the "Lenin-only" content of the volume.
2. Merge the contents of all volumes.
3. Produce a conc from the merged file.

My question is: what data structures do we use in the Lenin-only content? JSON records that include volume, title of work, page number? perhaps paragraph breaks? Perhaps number all paragraphs consecutively?

LeninConc

Thursday, July 2, 2015

from XML to Lenin-only to conc