Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle file/compressed/directory structures and file-chunking in FileInput classes #51

Open
justb4 opened this issue Aug 8, 2016 · 0 comments
Assignees
Milestone

Comments

@justb4
Copy link
Member

justb4 commented Aug 8, 2016

FileInput and derived classes like StringFileInput can handle lists of files from directory and glob.glob parameters. Still all file content is read/passed as a single Packet. Also .zip files are handled by a dedicated class ZipFileInput.

It should be possible to generalize FileInput to have derived classes read from files no matter if files came from directory structures, glob.glob expanded file lists or .zip files. Even a mixture of these should be handled. For example within NLExtract https://github.com/nlextract/NLExtract/blob/master/bag/src/bagfilereader.py can handle any file structure provided.

A second aspect is file chunking: a FileInput may split up a single file into Packets containing data structures extracted from that file. For example, FileInputs like XmlElementStreamerFileInput and LineStreamerFileInput
open/parse a file but pass file-content (lines, parsed elements) in
fine-grained chunks on each read(). Currently these classes implement this fully
within their read() function, but the generic pattern is that they
maintain a "context" for the open/parsed file.

So all in all this issue addresses two general aspects:

  • handle any file-specs: directories, maps, Globbing, zip-files and any mix of these
  • handle fine-grained file-chunking: on each invoke()/read() may supply part of a file: a line an XML element etc.

See also issue #49 for additional discussion which lead to this issue.
The Strategy Design Pattern may be applied (many refs on the web).

@justb4 justb4 added this to the Version 1.10 milestone Aug 8, 2016
@justb4 justb4 self-assigned this Aug 8, 2016
justb4 added a commit that referenced this issue Sep 4, 2016
Thanks for the Issue 49 correction. Yes agree we should work on #51 as to remove redundant functionality and future bugs.
@justb4 justb4 modified the milestones: Version 1.1.0, Version 1.2.0 Nov 4, 2017
@justb4 justb4 modified the milestones: Version 1.2, Version 1.3 May 3, 2018
@justb4 justb4 modified the milestones: Version 2.1, Version 2.2 Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant