LeninConc: from XML to Lenin-only to conc

Thursday, July 2, 2015

from XML to Lenin-only to conc

I may have missed some earlier discussions and suggestions, but let me ask a question about data structures. Assuming that we start with an XML format (fb2), we have the following stages of the process:

1. For each volume, extract the "Lenin-only" content of the volume.
2. Merge the contents of all volumes.
3. Produce a conc from the merged file.

My question is: what data structures do we use in the Lenin-only content? JSON records that include volume, title of work, page number? perhaps paragraph breaks? Perhaps number all paragraphs consecutively?

1 comment:

AnonymousJuly 14, 2015 at 4:06 PM
At present, I am using JSON records that include work title, year, page number, word numbers, contained mostly in the occurrence table outlined in http://leninconc.blogspot.com/2015/05/data-structures.html
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.