I would like to add to your implementation an Importer for pdf files. It would get met

How to add an Importer for pdf files? about dotopds HOT 9 CLOSED

shemanaev commented on May 26, 2024

How to add an Importer for pdf files?

from dotopds.

Comments (9)

shemanaev commented on May 26, 2024

Server itself targeted to use index files called inpx and don't provide a way to scan filesystem iteself. I'm not tested in other than fb2-ready inpx files scenarios so there might be (and will, i'm sure 😄) bugs.
But basically you need to:

produce .inpx file in some way for every root directory (i.e. if you have c:\lib1 and c:\lib2 you'll need two files)
if you want to have info that not fit into inpx format (cover, annotation) you'll have to implement IBookParser and register it to BookParsersPool
import every .inpx with related root (i.e. dotopds import c:\lib1 lib1.inpx)

The .inpx format description i found only in russian, so here is translation

from dotopds.

gerritv commented on May 26, 2024

Thank you, that helps me a lot. I have been reading the code and understand more than when I opened the Issue :-)
I can generate the .inpx from my PDF parser, will test that out and then decide what to do next.
I am impressed with the design, it looks very expandable.

from dotopds.

gerritv commented on May 26, 2024

I have the pdf scanner added (Utils/PdfParser.cs), I chose to recursively scan the directory and process each pdf rather than creating an intermediate file. I didn't add another parser to Parsers, the generic one there is sufficient as the Class in Utils does all the work, using InpxParser.cs as a template.

Pondering how to add it to the commands. Would it be better to create another Class in Tasks called PdfScanTask and then a 'pdfscan' command to run it? Much or most of the code in PdfScanCommand.cs would be the same as ImportCommand.cs. I had thought of generalizing ImportTask to make it take an option indicating what to import but that got more complex.

from dotopds.

gerritv commented on May 26, 2024

Ok, upon further pondering over an espresso I modified Import Task and ImportCommand:

Added required option ImportType=inpx or pdf,
added code in ImportTask to run one of those 2 tasks. Long term it might be best to add a base class for Parser in Parsers and move inpx/pdf parsers to that directory?
Now on to testing & debugging

from dotopds.

gerritv commented on May 26, 2024

You can see my code changes so far in https://github.com/gerritv/DotOPDS. Scanning of pdf's is working, but can't get query working via Aldiko. I tried forcing all books/pdf's to have Genre other,other but wtill no joy.
so, my next question is: where can I learn about using Owin and System.Web.Http to create some different web pages for serving pages?

from dotopds.

shemanaev commented on May 26, 2024

Hey Gerrit,
genre should be it's id, not human readable string. You should pick one from list.Add("sf_history"); like instruction in Genres.cs.
And your Book model will look like this:

var args = new Book
{
    Authors = new[] { author },
    Genres = new[] { "other" },
    Title = info.Title,
    File = Path.GetFileNameWithoutExtension(fi.FullName),
    Size = (int)fi.Length,
    Ext = "pdf",
    Date = info.CreationDate,
    Language = "en",
    Keywords = info.Keywords.Split(','),
    Archive = "",
};

I've also pushed some fixes to master, you should pull it.
And there is one problem i can't figure it out yet: LuceneImporter always uses RussianAnalyzer for now, as there is neither language autodetection, nor good way to populate it on import.

from dotopds.

gerritv commented on May 26, 2024

Thank you for those fixes/changes.
I now have things sort of working using FBReader. Aldiko and OPDSViewer don't like whatever is being returned.
I also need to work on File pathname as my files can be in sub directory off Library Path. Your solution above strips out the intermediate directories. My initial method was also wrong as it resulted in Library Path existing twice in the download link.

I will close this Issue as I am now well past the original question.
I would though appreciate a link or book or something where I can learn about WebApi2/Owin/Nowin in English (or Dutch)

from dotopds.

shemanaev commented on May 26, 2024

I learned WebApi 2 from official docs.
Nowin/OWIN is pretty straightforward through Nowin samples and OWIN spec.

Your solution above strips out the intermediate directories.

Yeah, I don't remember all the .net apis but you get the point 😉

from dotopds.

gerritv commented on May 26, 2024

Thx, The Message LifeCycle diagram is a huge help.

Yes, I got it :-) My setup is a bit unusual.
Now trying to figure out how to make some Pull requests without feeding you my pdf solution. (It relies on DebenuPDFLite, which is a bit of a pain to install but is free). Looking at

git cherry-pick

from dotopds.

How to add an Importer for pdf files? about dotopds HOT 9 CLOSED

Comments (9)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent