norconex / importer Goto Github PK

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.

Home Page: http://www.norconex.com/collectors/importer/

License: Apache License 2.0

Java 98.94% Shell 0.06% HTML 0.93% Batchfile 0.06%

extract html java java-library manipulation norconex-importer parse pdf

importer's People

Contributors

Stargazers

Watchers

importer's Issues

Restricting ReplaceTransformer

From @OkkeKlein, originally submitted at Norconex/crawlers#89:

The regex in .test. is never passed to the importerhandler. Only field value.

metadata extraction on News Site

hi there
I am trying to crawl a News Site, but I am unable to extract information from Metadata in the pages there is no metadata on my Solr after I crawl some 50 pages, and no evidence of why is that, I see all content is in there and Metadata of the server and crawler but now from this pages it self, could you please point me out some resources of links to extract that type of data?
an example of metadata is like this:

<meta property="og:type" content="article" />

<meta property="ps:publishDate" content="2004/01/19" />

<meta property="twitter:url" content="....">

thanks

ReduceConsecutivesTransformer escaping regexes

Edited: regex for spaces works, just the escaping of backslash remaining,

<reduce>\W</reduce> results in (\\W)+

Import only pages with url matched regexp

I want importer accept only pages which url match regexp from config.
I believe java Class RegexMetadataFilter does that. The question is: which metadata field match page's url and does it exist at all.
I use default importer configuration and don't re-map any field.

Wrong class name in example

The example at http://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/ForceSingleValueTagger.html should show

<tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">

instead of

<tagger class="com.norconex.importer.handler.tagger.impl.SingleValueTagger">

Bug un ReplaceTagger?

Hi,

For a given crawler, I extract/tag a field EXP_NAME+COUNTRY that contains both the name and the country of an author (in the format "firstname other-names lastname [CountryCode]").

Thanks to a ReplaceTagger with a regex, I expected to extract both information in separate fields: EXP_NAME and EXP_COUNTRY.

I've made an (xml) example crawler file here to demonstrate:
test_norco.txt

Unfortunately, in the case where the country is not there (no "[]"), the crawler generates a field EXP_COUNTRY with an empty string, but no EXP_NAME field!

What seems strange to me is the the simple Java code attached below works, whereas it implements the same regexes:
Test.txt

Am I mistaken somewhere (it's Friday I might have overlooked something :) ) or is there a bug in ReplaceTagger?

[DOMSplitter] StackOverflow with norconex-importer 2.5.2

with the following configuration (crawling depth 0):

<importer >
                <preParseHandlers>
                <splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
                        selector=".caption"  sourceCharset="UTF-8"/> 
[...]
<importer>

I get a StackOverflowError :

_java.lang.StackOverflowError
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
at java.io.File.exists(File.java:819)
at sun.misc.FileURLMapper.exists(FileURLMapper.java:78)
at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:890)
at sun.misc.URLClassPath$JarLoader.access$700(URLClassPath.java:756)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:838)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:831)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:830)
at sun.misc.URLClassPath$JarLoader.(URLClassPath.java:803)
at sun.misc.URLClassPath$JarLoader$3.run(URLClassPath.java:1057)
at sun.misc.URLClassPath$JarLoader$3.run(URLClassPath.java:1054)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1053)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1013)
at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:983)
at sun.misc.URLClassPath$1.next(URLClassPath.java:240)
at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:250)
at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
at java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:354)
at java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393)
at java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474)
at javax.xml.parsers.FactoryFinder$1.run(FactoryFinder.java:293)
at java.security.AccessController.doPrivileged(Native Method)
at javax.xml.parsers.FactoryFinder.findServiceProvider(FactoryFinder.java:289)
at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:267)
at javax.xml.parsers.SAXParserFactory.newInstance(SAXParserFactory.java:127)
at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:51)
at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:42)
at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:206)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:472)
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at com.norconex.importer.doc.ContentTypeDetector.doDetect(ContentTypeDetector.java:111)
at com.norconex.importer.doc.ContentTypeDetector.detect(ContentTypeDetector.java:75)
at com.norconex.importer.Importer.doImportDocument(Importer.java:233)
at com.norconex.importer.Importer.importDocument(Importer.java:195)
at com.norconex.importer.Importer.doImportDocument(Importer.java:280)
at com.norconex.importer.Importer.importDocument(Importer.java:195)
at com.norconex.importer.Importer.doImportDocument(Importer.java:280)
at com.norconex.importer.Importer.importDocument(Importer.java:195)
at com.norconex.importer.Importer.doImportDocument(Importer.java:280)

[...]_

FeatureRequest: MergeTagger

Hi,

For complicated reasons, I need a MergeTagger, that would take arguments: fromField1, fromField2, toField, separator and merge/concatenate value-by-value the contents of sourceField1+sourceField2 into destinationField with optional separation with separator).

Example:
EXP_FIRST_NAME=Fabien^|~Albert
EXP_LAST_NAME=Coco^|~Rico
-- separator=" "-->
EXP_NAME=Fabien Coco^|~Albert Rico

I'm currently trying to write my own quick&dirty version, but norconex's importer you would probably benefit from directly including a clean a generic version!

In case number of values in the fromFields differs, it could raise an exception or accept a default value?

The prototype of the configuration would be:

 <tagger class="com.norconex.importer.handler.tagger.impl.MergeTagger">
      <merge fromField1="sourceFieldName" fromField2="sourceFieldName" toField="targetFieldName" separator="sep">
      </merge>
 </tagger>

Html elements import

I'm using the latest Norconex Http collector. By default the importer removes Html elements and just keeps the body text.
How do I configure it to keep specific Html elements. For example,I would like the parsing output to include elements with a URL like the following:

<a class='download-link' href='http://download.redis.io/releases/redis-3.0.1.tar.gz'>.

Thus, the href url value would be found in the same relative position in the text.

Thanks.

bug in com.norconex.importer.handler.tagger.impl.DOMTagger#tagApplicableDocument()

Hello,

thank you for doing nice open source software. I found a problem in

com.norconex.importer.handler.tagger.impl#tagApplicableDocument()

if you configure several dom selectors for a dom tagger, for example

        <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <dom selector="div.address-block span" toField="hoho_address_street" overwrite="false" />
            <dom selector="div.address-block" toField="hoho_address" overwrite="false" />
            <dom selector="div.boolean-listing span" toField="hoho_boolean_labels" overwrite="false" />

and one of these selectors does not find anything, then other not evaluated selectors are just ignored. The problem is here:

            for (DOMExtractDetails details : extractions) {
                Elements elms = doc.select(details.selector);
                // no elements matching
                if (elms.isEmpty()) {
                    return;
                }
...
...
                if (values.isEmpty()) {
                    return;
                }

instead of "return" you should code "continue"

Kind regads

Mike

Feature request: contentsize of document from parser

Sometimes Content-Length information is not available in the header. Maybe the parser can provide the info?

Transforming pdftotext parsed text results in concatenated strings

Transformation of text results in concatenated strings with pdftotext.

The only difference with PDFBox is that the latter has a whitespace followed by CRLF as EOL and pdftotext doesn't have this whitespace.

However the transformer should transform the CRLF to a whitespace (2 actually).

As this is not happening the output is concatenated,

no such function "toUpperCase"

hi there
sorry to bother you but I am having an issue with the script bellow

           <transformer class="com.norconex.importer.handler.transformer.impl.ScriptTransformer">
                    <script><![CDATA[
                    empresas = metadata.getString('empresas');
                    consorcios_UT = metadata.getString('consorcios_UT');
                    consorcios_UT = consorcios_UT.toUpperCase();
                    metadata.setString('consorcios_UT', consorcios_UT);
                    metadata.setString('empresas', empresas);
            ]]></script>
            </transformer>

after running the script I am getting and error saying that is not a 'toUpperCase()' defined:

Caused by: :3 TypeError: null has no such function "toUpperCase"
at jdk.nashorn.internal.runtime.ECMAErrors.error(ECMAErrors.java:57)
at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:213)
at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:185)
at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:172)
at jdk.nashorn.internal.runtime.linker.NashornBottomLinker.linkNull(NashornBottomLinker.java:180)
at jdk.nashorn.internal.runtime.linker.NashornBottomLinker.getGuardedInvocation(NashornBottomLinker.java:66)
at jdk.internal.dynalink.support.CompositeGuardingDynamicLinker.getGuardedInvocation(CompositeGuardingDynamicLinker.java:124)
at jdk.internal.dynalink.support.LinkerServicesImpl.getGuardedInvocation(LinkerServicesImpl.java:154)
at jdk.internal.dynalink.DynamicLinker.relink(DynamicLinker.java:253)
at jdk.nashorn.internal.scripts.Script$^eval_.:program(:3)
at jdk.nashorn.internal.runtime.ScriptFunctionData.invoke(ScriptFunctionData.java:637)
at jdk.nashorn.internal.runtime.ScriptFunction.invoke(ScriptFunction.java:494)
at jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:393)
at jdk.nashorn.api.scripting.NashornScriptEngine.evalImpl(NashornScriptEngine.java:418)

could you please check that for me?
thanks

DOMTagger & DOMSplitter and XML file size

Hi,

I've been using a lot DOMTagger and DOMSplitter in my crawlers, as I'm used to this way of simply extracting data from webpages (note: I come from the Heritrix world using XPathes...).

In the docs: http://www.norconex.com/collectors/importer/latest/apidocs/ I read: "This class constructs a DOM tree from the document content. That DOM tree is loaded entirely into memory. Use this splitter with caution if you know you'll need to parse huge files. It may be preferable to use a stream-based approach if this is a concern."

I have a question about what do you mean by "huge"? How big do you think XML files can be?

Indeed, I recently started crawling files from: https://pairbulkdata.uspto.gov/ . Inside the "xml" zip file provided on this page, there are XML files listing patents, split by year. For the recent years (>2000), these XML files can grow up to 6GB++.

I assumed this was actually huge, so before running any norconex collector, I used Perl::xml_xplit to split these XML files into smaller chunks. I started with chunks of 150MB. When I started my filesystem-collector in a JVM with a max of Heap of 8GB, I thought this would be reasonable, but the code seemed to "freeze" very rapidly (0 CPU usage, in Sleep state, 10GB virt mem). Thus I reduced the size of the chunks up to 50MB and restarted the code, but I'm still facing the same problem again: the first split is treated (very slowly: in at least a day), then the second split seemed frozen (for 4 days now).

Is actually 50MB to big for an XML file to be treated with a combination of DOMTaggers & DOMSplitter?

NOTE1: the filesystem-collector is configured to use only one Thread.
NOTE2: I attempted to understand what the processes are doing. Thanks to https://meenakshi02.wordpress.com/2011/02/02/strace-hanging-at-futex/ , I found a few commands to help: ps -efL|grep <Process Name> and strace -p <processes id>. In my case, they returned:

[pid  2642] futex(0x66b5909459d0, FUTEX_WAIT, 2643, NULL <unfinished ...>
[pid  2636] wait4(-1, ^CProcess 2642 detached

So apparently I indeed have only a single child process that is wait(ing) for something and its parent process that is blocked by a futex.

NOTE3: I've also run "iotop" and the processes do not seem to have any IO activity...

As a consequence, I'm totally clueless about what they could be waiting on...

Concatenated first line with certain PDF's

This issue derailed the OOM discussion, so a new issue was created.

I added some logging to the ReplaceTransformer and found out that certain PDF's have a concatenated string of the first line (7 times) when delivered to the transformer. After that the content is normal.

Using command line (pdfbox-app) the content shows normal 1 time first line.

I added the file to Dropbox.

Is it possible to keep html tag in .cntnt ?

Hi Pascal,

I'm doing a little project with norconex http collector which will fetch news that with my city in the keywords field of metadata from big news website .

The fetching works well but the content files generated are raw text with all links in it, which is hard to code any program to grab the main article from it. Is there anyway to keep the html tags in the content file?

I have already read issue #15 , sadly I failed to code a custom HTML parser (I'll keep trying) and ignore the parsing of html files will stop regexMetadataFilter from working.

Thanks!

Can't get DebugTagger to produce any output

I've got the following as my crawler configuration:

<crawlers>
  <crawler id="Wiki Crawler">
    <startURLs stayOnDomain="true">
      <url>http://wiki.linaro.org/</url>
    </startURLs>
    #parse("shared/importer-config.xml")

    <importer>
      <postParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
          <rename fromField="document.contentEncoding" toField="content_encoding" overwrite="true" />
          <rename fromField="document.contentType" toField="content_type" overwrite="true" />
        </tagger>
        <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
            logContent="true" >
        </tagger>
      </postParseHandlers>
    </importer>
  </crawler>
</crawlers>

When I run the code, though, nothing appears in the output from the DebugTagger.

log4j is set to have the main output on DEBUG:

log4j.logger.com.norconex.collector.http=DEBUG
log4j.logger.com.norconex.collector.core=DEBUG
log4j.logger.com.norconex.importer=DEBUG
log4j.logger.com.norconex.committer=DEBUG

The documentation for DebugTagger says the list of fields is optional, so I'm expecting everything to get dumped in the output/log, but I'm not seeing anything.

Preprocess question

Hi there
i am trying to crawl a website with several file types and I have to strips before and after, and when I hit some file not application/HTML I am getting an error, is it possible to apply strips just to a single type of files? PLease,
I already try to strip the crawler in HTML and other types and the other crawler just get stuck and no crawling at all
Thanks a lot
Angelo

Bug in com.norconex.importer.handler.tagger.impl.DOMTagger?

Hi,

I'm using norconex-collector-http-2.5.1 with lib/norconex-importer-2.6.0-SNAPSHOT.jar

In the crawler configuration file below, you'll find that DESCRIPTION and DESCRIPTION-TEST3 have exactly the same definition/selector. However, in the resulting "meta" file, I get different results for these tags:

DESCRIPTION=Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Find the most relevant companies for you^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Find the most relevant companies for you^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Find the most relevant companies for you^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Find the most relevant companies for you^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  
DESCRIPTION-TEST3=Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  
DESCRIPTION-TEST2=Plating and Surface Coating
DESCRIPTION-TEST1=<div class\="company__short-description" style\="word-break\: keep-all;">\n  Plating and Surface Coating \n</div>^|~<div class\="company__description">\n <p></p>\n <p>Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd.&nbsp;Polymertal maintains state of the art R&amp;D abilities that are dedicated to constantly develop and improve the&nbsp;products.</p>\n <p></p> \n <p></p>\n <p>The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating &\#x2013; resulting in the identification and development of a unique plating process.</p>\n <p></p> \n <p></p>\n <p>As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.</p>\n <p></p> \n <p></p>\n <p>&nbsp;</p>\n <p></p>\n</div>

Why do I get 8 times the values for DESCRIPTION?

<?xml version="1.0" encoding="UTF-8"?>
<!-- Testing crawler for Test -->

<httpcollector id="Test">

  <!-- Decide where to store generated files. -->
  <progressDir>./test/progress</progressDir>
  <logsDir>./test/logs</logsDir>

  <crawlers>
    <crawler id="Test">

      <robotsTxt ignore="true" />

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://finder.startupnationcentral.org/c/polymertal</url>
      </startURLs>


      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./test</workDir>

      <!-- TODO: Use several threads: set to 5??? -->
      <numThreads>1</numThreads>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>0</maxDepth>  <!-- TODO: Set to 2??? -->

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <!-- Since 2.3.0: -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />


      <!-- Crawl only companies pages -->
      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
          https?://finder\.startupnationcentral\.org/c/[a-z0-9_+-]+
        </filter>
      </referenceFilters>

      <!-- Document extraction/manipulation -->
      <importer>

        <preParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
                  <!--sourceCharset="UTF-8"-->
            <dom selector="div[class~=company__main]>div[class~=company__short-description],   div[class~=company__main]>div[class=company__description]" toField="DESCRIPTION-TEST1"
                 overwrite="true"
                 extract="outerHtml" />
            <dom selector="div[class~=company__main]>div[class~=company__short-description],   div[class~=company__main]>div[class=company__description]" toField="DESCRIPTION-TEST2"
                 overwrite="true"
                 extract="ownText" />
            <dom selector="div[class~=company__main]>div[class~=company__short-description],   div[class~=company__main]>div[class=company__description]" toField="DESCRIPTION-TEST3"
                 overwrite="true"
                 extract="text" />
            <dom selector="div[class~=company__main]>div[class~=company__short-description],   div[class~=company__main]>div[class=company__description]" toField="DESCRIPTION"
                 overwrite="true"
                 extract="text" />
          </tagger>
        </preParseHandlers>
      </importer>

<!-- Basic committer, for the record -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./test/crawledFiles</directory>
      </committer>
<!-- -->

    </crawler>
  </crawlers>

</httpcollector>

using ScriptTransformer with a tagged field

hi there
I am having an issue with a transformation I am trying to put in place

after I capture some information from the content field like this:

<pattern field="consorcios_UT" >
  ((([U|u](NION|nion)\s[T|t](EMPORAL|emporal))|([C|c](ONSORCIO|onsorcio)))\s)+(([A-Z][\p{L}]*[\p{S}]*)+\s)
</pattern>

I am trying to use this field (consorcios_UT) in a script like this:

<transformer class="com.norconex.importer.handler.transformer.impl.ScriptTransformer">
    <script><![CDATA[
        consorcios_UT = consorcios_UT.replace(/consorcio/g, 'consorcio');
        /*return*/ consorcios_UT;
        ]]></script>
</transformer>

but I get and error saying that is not such a field as "consorcios_UT" defined.

my question how should I process a field that is just created in order to normalize all information and avoid case miss match or strip spaces or punctuation signs?

thanks a lot
angelo

Feature request for com.norconex.importer.handler.tagger.impl.DOMTagger

Hi again,

description of the problem
I'm trying to extract the people from pages like:
http://finder.startupnationcentral.org/c/polymertal
For this purpose, I'm using the following code:

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <dom selector="h2[class~=section__title]:containsOwn(The Team)+div[class~=section__content]>div[class~=company-team] div[class~=company-team__photo-wrapper]>img" toField="MEMBER-IMAGE"
                 overwrite="true"
                 extract="attr(src)" />
            <dom selector="h2[class~=section__title]:containsOwn(The Team)+div[class~=section__content]>div[class~=company-team]>*[class~=company-team__member]>div[class~=company-team__info]>div[class~=company-team__name]" toField="MEMBER-NAME"
                 overwrite="true"
                 extract="text" />
            <dom selector="h2[class~=section__title]:containsOwn(The Team)+div[class~=section__content] div[class~=company-team]>*[class~=company-team__member]>div[class~=company-team__info]>div[class~=company-team__position]" toField="MEMBER-POSITION"
                 overwrite="true"
                 extract="text" />
</tagger>

Then, I developed a committer that reconstructs a list of 3 "Person" objects (each one having 3 fields: name, position, image) based on the lists IMAGE, NAME, POSITION that have length 3.
The problem is that on some pages, the IMAGE or POSITION might lack. As a consequence, the IMAGE or POSITION lists might contain less values than the NAME list. In such cases, I'm unable to reconstruct the "Person" objects, since I don't know which value in the NAME list correspond to which value in the IMAGE & POSITION lists.

feature request
As a consequence, I would love to be able to specify to some of the DOMTagger entries that I want them to return an empty string when they do not match the selector. This is the simplest solution I can think of to ensure that I have always the correct number of values in each lists.

PS: If you have any idea on how I can better solve this problem, I'm all ears.

BTW, I just realized I'm using "overwrite=true" everywhere and that I still get multiple values, so there might be a bug here :)

Warning users when configuration is incorrect?

Hi,

I just ran the following boggus configuration (changed RenameTagger to CopyTagger, but forgot to change the internal HTML-tag from "rename" to "copy")

Would be great to have warning when doing such mistakes (using non-existent config in a Tagger) :)

unicode UTF-8 supporting ?

importer could to support utf-8 content ?
if yes , have it a special configuration for unicode supporting for example how can i to parse this text ?
"این یک متن فارسی برای تست است ، ژاله ، گچ ،تشکر"

TextPatternTagger issue

feature request - a ShellTagger

So, another tool, scrapy, offers a lot less out of the box - but it does offer a shell you can easily invoke on any URL and explore what selectors etc. may do to it.

A similar feature would be a Nashorn/Script-engine based shell tagger that drops you into something interactive.

TitleGeneratorTagger for non-English text

Was testing usability of output and noticed "Performing monolingual clustering in: English". Documentation said multiple languages were supported. Are they auto-detected?

Is there a way to override Carrot's clustering algorithm? Use custom stopwords?

I doubt the quality will ever be usable for titles, but it could be interesting to add as keywords.

What is XML equivalent to default importer configuration?

http://www.norconex.com/collectors/importer/configuration does a good job of laying out the possibilities, but also doesn't quite explain what is the implicit, default configuration for the default Tika-based importer.

Also tracking boiler pipe 1.2.0, which is not in Maven, and which blocks some issues with Tika integration and Solr integration, where they end-up having "Skip navigation" and such as part of the textual content.

Not bad in a bag-of-words model, not so great for contextual snippet generation.

Feature request for com.norconex.importer.handler.tagger.impl.DOMTagger

Following your suggestion in #26 I'm opening this new ticket to suggest the possibility to add a "fromField" in the DOMTagger, so that the user could split the original page into pieces with a first DOMTagger (or any other Tagger that would permit such a thing) and then use other DOMTaggers to "parse" each piece separately.
This would allow a "recursive" parsing of the page, for instance to extract multiple people inside a unique organization.
Clearly, this tree structure of Taggers, might result in a tree structure for the generated tags.
As for how to represent the resulting tags:

First, to keep the configuration as generic as possible, I'm not sure you should enforce any pattern in the generated tags. Maybe the user deploys this "Divide&Conquer" technique to ease her/his writing of the crawler or speed-up things, but still wants a "flat" structure of tags?
In case you want to enforce a tree structure in the generated tags, I think that "representing the hierarchical structure safely" would depend on if you have an idea of the final structure beforehand. If it is the case, then using a naming based on a pre/post/infix-traversal of the tree (possibily with numbering of the nodes to ensure unicity of the names), would suffice.
I believe this is the case, as the structure of the configuration file gives you the required information, since the final tag hierarchy is strongly related to the taggers' hierarchy. But I'm not sure how easy it would be to extract this information.

Also a "sub-feature" that would be great is to enable the reuse of a Tagger on various pieces. For instance, the user could split a page in x pieces and run the same Tagger on each piece. I can imagine doing this either by having the "fromField" being a list or by assigning an id to an instance of a DOMTagger and using references to this instance at several places in the configuration file. However, in the latter case, I'm not sure we can guarantee there is not blocking/infinite loop within the Taggers' references.

ReduceConsecutivesTransformer not working on none html content

If I remember correctly <reduce>\s</reduce>removed all whitespaces in old build of http collector (this was fixed). Current setup with fliesystem collector is having same trouble again. Is it possible this is because of different collector or mimetype?

Launch the Transformer without an Importer

To skip costly launch of the collector-http, I'd like to create a kind of test project and try there different transformation strategies.

I created a class inside the same namespace, as the Transformer's and have problems with launch of transformTextDocument(...) method.

public class MyDocumentTransformer extends AbstractCharStreamTransformer ... 
...
public class TestClass {
...
        MyDocumentTransformer transformer = new MyDocumentTransformer();
        transformer.loadHandlerFromXML(configuration);
        transformer.transformTextDocument(param1, param2, param3, param4, param5);

I'd like to know what to set as the method's parameters, as docs tells only about types, but not about semantics.

Irregular behaviour of TextBetweenTagger

I'm using TextBetweenTagger in order to acquire HTML code from crawled pages. The configuration looks like:

<preParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" inclusive="true">
        <textBetween name="rawhtml">
            <start>^.*</start>
            <end>.*$</end>
        </textBetween>
    </tagger>
</preParseHandlers>

However, this has put the crawler in an inconsequent behaviour. Upon crawling ~4k webpages, 10-150 of them have their HTML split up into several rawhtml fields, instead of being put in one field, as the majority of the documents has.

It is arbitrary which documents which gets this behaviour from each crawl, and the number of broken documents vary as well.

Any other way you would suggest to acquire the unparsed HTML content?

EDIT: Version 2.0.1 is being used.

How to work only on elements splitted by a DOMSplitter

Hi,

I'm trying to parse XML files (from https://pairbulkdata.uspto.gov/) in which patents are described.

In order to parse/import these files, I use a DOMSplitter to separate the data for each patent, then a DOMTagger to extract information for each patent.

Unfortunately, more data than those present in the XML file is submitted to my committer. There is some redundancy.

I think the problem comes from the fact that both the whole file and the splitted parts get parsed/imported.

Since I've noticed some information is added by the DOMSplitter to the "document.reference", I've tried to add a filter to my DOMTagger, so that my crawler treats only the splitted elements. More specifically, I used:

           <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
                    onMatch="include" field="document.reference" >
              !html
            </filter>

Unfortunately, this does not work.

Is there any simple way to make a (DOM)Tagger work only on splitted elements from a (DOM)Splitter?

NoSuchMethodError and PDDocument.load

I have such exception:

java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocument.load(Ljava/io/InputStream;Ljava/lang/String;Z)Lorg/apache/pdfbox/pdmodel/PDDocument;
at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:162)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:377)
at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:162)
at com.norconex.importer.Importer.parseDocument(Importer.java:418)
at com.norconex.importer.Importer.importDocument(Importer.java:314)
at com.norconex.importer.Importer.doImportDocument(Importer.java:267)
at com.norconex.importer.Importer.importDocument(Importer.java:195)
at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)
at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:215)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:475)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:376)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:689)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I'm trying to use Norconex HTTP Collecotor v2.2.1, and it has dependencies of Norconex Importer v2.3.1 and Apache PDFBox v2.0.0.SNAPSHOT.

EnhancedPDFParser line 162 is:

pdfDocument = PDDocument.load(tstream, password, false);//, preserveMemory);

and as I saw PDFBox v2.0.0.SNAPSHOT, PDDocument class really doesn't have such method signature.

OutOfMemoryError: GC overhead limit exceeded

What is best approach to fix this?

Exception in thread "pool-1-thread-4" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:75)
at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:61)
at org.apache.fontbox.ttf.PostScriptTable.read(PostScriptTable.java:96)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:299)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:159)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:135)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:96)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:130)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:93)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:50)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:802)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:464)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:283)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:248)
at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:144)
at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:172)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117)
at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:374)
at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:159)
at com.norconex.importer.Importer.parseDocument(Importer.java:414)
at com.norconex.importer.Importer.importDocument(Importer.java:314)
at com.norconex.importer.Importer.doImportDocument(Importer.java:267)
at com.norconex.importer.Importer.importDocument(Importer.java:195)
at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)

An Error Parsing MP4 Files

It seems like a library is missing for MP4 parsing:
Exception in thread "pool-1-thread-1" INFO [FilesystemCrawler] Projects: Re-processing orphan Files (if any)...
java.lang.NoClassDefFoundError: org/aspectj/lang/Signature
at org.apache.tika.parser.mp4.MP4Parser.parse(MP4Parser.java:117)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at com.norconex.importer.parser.impl.AbstractTikaParser$RecursiveMetadataParser.parse(AbstractTikaParser.java:133)
at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:169)
at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:135)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at com.norconex.importer.parser.impl.AbstractTikaParser$RecursiveMetadataParser.parse(AbstractTikaParser.java:133)
at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:169)
at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:135)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at com.norconex.importer.parser.impl.AbstractTikaParser$RecursiveMetadataParser.parse(AbstractTikaParser.java:133)
at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:143)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at com.norconex.importer.parser.impl.AbstractTikaParser$RecursiveMetadataParser.parse(AbstractTikaParser.java:133)
at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:99)
at com.norconex.importer.Importer.parseDocument(Importer.java:379)
at com.norconex.importer.Importer.importDocument(Importer.java:266)
at com.norconex.collector.fs.crawler.DocumentProcessor$ImportModuleStep.processDocument(DocumentProcessor.java:146)
at com.norconex.collector.fs.crawler.DocumentProcessor.processURL(DocumentProcessor.java:91)
at com.norconex.collector.fs.crawler.FilesystemCrawler.processNextQueuedFile(FilesystemCrawler.java:384)
at com.norconex.collector.fs.crawler.FilesystemCrawler.processNextFile(FilesystemCrawler.java:310)
at com.norconex.collector.fs.crawler.FilesystemCrawler.access$100(FilesystemCrawler.java:62)
at com.norconex.collector.fs.crawler.FilesystemCrawler$ProcessFilesRunnable.run(FilesystemCrawler.java:545)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.aspectj.lang.Signature
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 44 more

TikaException: TIKA-198

There are a lot of this kind exceptions in log file.

com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: 
TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@6f8fdac3

...

Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; 
read 0x615C316674725C7B, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

Is it right to ask here? Or i should write open issue somewhere more?

Question on ReplaceConsecutivesTransformer

Can I get this to apply to the content, and smush it all into a single-line?

Thanks

Boilerpipe usage on importer

hi there
I am trying to figure it out how to use the Boilerpipe jar file, however I am not able to do it. could you please post some basic instructions or share with me an address ?
thanks a lot

OutOfMemory when parsing a document with multiple embedded objects

The issue can be reproduced with Cisco Icon Library: zip
High CPU usage and OOM.
Should be reproducible with a default example configuration with file system connector (OCR is off).
Could you please look into that?
Thanks a lot!

ReduceConsecutivesTransformer behavior

My parsed content has a lot of CRLF I am trying to clean up. Should

<reduce>\r</reduce>
<reduce>\r\n</reduce>
<reduce>\n</reduce>

be working or is the \r\n not supported?

new DOMTagger's "defaultValue not working?

Hi,

I'm trying to use the new feature to assign a new value in case no match is found, But I can seems to get it to work.

I've got a general tagger that extract the part of the page with members of an organization:

<tagger>
  ...
  <dom selector="div.founders" toField="ORG_MEMBERS"
       overwrite="true"
       extract="outerHtml" />
  ...
  </tagger>

The another tagger that extracts information for each member:

  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger" fromField="ORG_MEMBERS">
     <dom selector="div ul li div h5:last-of-type" toField="EXP_NAME"
          overwrite="true"
          extract="ownText" />
      <dom selector="div ul li div h5:nth-of-type(1) a img" toField="EXP_IMAGE"
          overwrite="true"
          defaultValue="no-image"
          extract="attr(src)" />
    </tagger>

Unfortunately, in the .meta file I get:
EXP_IMAGE=https://d1qb2nb5cznatu.cloudfront.net/users/89122-medium_jpg?1405520924^|~https://d1qb2nb5cznatu.cloudfront.net/users/1512896-medium_jpg?1441125521
...
EXP_NAME=Krishnan Menon^|~Christian Sutardi^|~Marshall Utoyo^|~Srinivas Sista^|~Filippo Lombardi

So there are 5 names, but only 2 images... None of which is "no-image". Did I make a mistake in my code or is there a bug in the DOMTagger?

FYI, I've just updated the libs to the last ones available on the website, but my code still does not work...

PS: the source page is: http://500.co/startup/fabelio-2/

example of response Processor

hi there
Is it possible for you to provide for me an example of the response Processor? please I'm totally lost on this

thanks a lot

question - itemscope and itemtype

Using the technique in #44, I discovered I didn't need to do anything to extract schema.org metadata, because either Norconex importer or Tika will create metadata for objects within an itemscope.

My question is just whether this is Norconex or Tika. This affects my evaluation for my boss - if Tika, we had it either way, if Norconex, it is another strong win for Norconex over Scrapy, and #44 is providing an environment much like the Scrapy shell.

Difference between content type fields

I now have installed Kibana on another host, and it is proving fruitful in allowing met to explore my data. I find it interesting that the KeepOnlyTagger allows multiple cases through of field, but these are not normalized by Elasticsearch!

This is leading to a good amount of testing, and I wonder:

What regexp flags are used on <fromFieldsRegex> in the MergeTagger, and in general?
Is there a way (other than ScriptTagger) to remove empty fields?
What are the differences between Content-Type, Content-Type-Hint and document.contentType? If I plan to keep one, which would you suggest?

RDBMSQueryTagger and JNDI context during a crawl

Pascal,

I need to code-up something like this for my own use. On my side, I will probably just embed a JDBC connect string into an XML attribute or tag, but I would rather if each collector, as part of collector core, could establish a JNDI context using some context.xml similar something, external to the collector's XML configuration. Because, the connection factory should probably be shared by multiple RDBMSQueryTagger.

My specific use-case is that some of the URLs I crawl are already URLs linked to one or more MedlinePlus health topics (e.g. https://medlineplus.gov/bloodsugar.html contains a bunch of links), and we want to find "More Like This" with a multiple document similarity query over what we have crawled - for which purpose I want to tag each document crawled that is already in the database with its topic ids. The RDBMSQueryTagger is a more generic mechanism to accomplish this.

Confusion between the fact that a tag has-no-value and is-set-to-an-empty-value in DOMTagger

Hi,

I've built a minimal test which demonstrate that there is a confusion between the fact that a tag/attribute has no value (i.e. it is not present) and it has an empty value (i.e. it is present in the document but with an empty String(*) as a value).

I believe this is a bug, in the same manner as List a = null in Java is different to List a = new List().

Also, mixed with the "new" defaultValue feature of DOMTagger, this mixes things up. Indeed, since a tag that is present but with a defined value of empty String will be considered as not present, it will be replaced by the defaultValue, thus loosing its original (empty) value.

The provided example files demonstrate the problem: they extract separately FirstNames, MiddleNames and LastNames, then merge First&Middle Names into a new FirstName. To ensure we have the same number of values for all attributes of an author, we use the defaultValue option.
If you run this example you get:

EXP_FIRST-NAME-AUX=firstname11^|~firstname12^|~firstname13^|~firstname14^|~firstname21^|~firstname22^|~firstname23^|~firstname24^|~firstname31^|~firstname32^|~firstname33^|~firstname34
EXP_MIDDLE-NAME-AUX=middlename11^|~middlename12^|~middlename13^|~middlename14^|~middlename21^|~NO_MIDDLE-NAME^|~NO_MIDDLE-NAME^|~middlename24^|~NO_MIDDLE-NAME^|~middlename32^|~NO_MIDDLE-NAME^|~middlename34
EXP_FIRST-NAME=firstname11 middlename11^|~firstname12 middlename12^|~firstname13 middlename13^|~firstname14 middlename14^|~firstname21 middlename21^|~firstname22 NO_MIDDLE-NAME^|~firstname23 NO_MIDDLE-NAME^|~firstname24 middlename24^|~firstname31 NO_MIDDLE-NAME^|~firstname32 middlename32^|~firstname33 NO_MIDDLE-NAME^|~firstname34 middlename34

As can be seen, the empty String used for middlename22 and middlename23 are replaced by "NO_MIDDLE-NAME", in the same manner as for the non-present middlename31 and middlename33, which is not what I would expect.

Also this is not in concordance with what JSoup resturns:

$ ./test_csssel_expr.sh crawlers/testDefVal/example2.xml "publication authors author middlename" # I can send you this script is necessary
middlename11
middlename12
middlename13
middlename14
middlename21


middlename24
middlename32
middlename34

As can be seen, JSoup makes a difference between when the tag is there and set to an empty string (empty lines are returned in place of middlename22 and middlename23) and when the tag is not set (no line is returned for middlename31 and middlename33).

By the way, I think this is a general problem in DOMTagger (and maybe other Norconex's products?), since this confusion also occurs in the treatment of the "defaultValue" itself.

Indeed, in the same example files, if you replace the defaultValue with an empty string (as in the first commented-out line), then the defaultValue replaces a non-existent value by another non-existent (thus dropped) value, instead of an empty String(*). As a consequence, the defaultValue principle has absolutely no effect and we end up with different number of MiddleNames as of First&LastNames... Which is not what I would expect...

(*)Finally, note that, since you seem to trim() all the entries in the XML files, the problem does not only occurr for empty Strings, butalso for any sequence of spaces chars E.g., defining the defaultValue to any (possibly empty) sequence of spaces, will result in this option having no effect (this is what the 2nd commented-out line demonstrates).

example2.xml.txt
testDefVal2.xml.txt

Importer Fork

Hello,

I plan to create a copy of existent importer which contains some additional specific functional.

It is possible I misunderstood all importer's configuration options and creating a duplicate: needed a kind of advanced heuristics for working with DOM tree of XML/HTML documents (jsoup, or similar). I see it as a class for http-collector.

Reading importer sources I realized that they have a complex structure so I need your advice: which classes I may take as a sample, and how may I organize the project.

Thank you.

Import RSS Feed

Hi
i want to collect pages from rss feed
this is my crawler but no result
please help me

<httpcollector id="IDOL HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlerDefaults>  
    <numThreads>4</numThreads>
    <maxDepth>1</maxDepth>
    <maxDocuments>-1</maxDocuments>
    <keepDownloads>false</keepDownloads>
    <orphansStrategy>IGNORE</orphansStrategy>    
    <referenceFilters>
      <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude" caseSensitive="false" >
        jpg,gif,png,ico,css,js</filter>    
    </referenceFilters>          
    <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
     </importer> 

    <committer class="com.norconex.committer.idol.IdolCommitter">

        <!-- To commit documents to IDOL or DIH: -->
        <databaseName>Webcontent</databaseName>

        <!-- To commit documents to CFS: -->
        <host>127.0.0.1</host>
        <indexPort>9001</indexPort>
        <dreAddDataParams>
            <param name="Job">Norconex Job</param>
        </dreAddDataParams>
    </committer>
  </crawlerDefaults>

  <crawlers>    
    <crawler id="Rss Ou Va Algerie">
    <startURLs>
        <url>http://www.lefigaro.fr/rss/figaro_politique.xml</url>
      </startURLs>
      <workDir>./examples-output/Ou_Va_Algerie</workDir>     
      <sitemap ignore="true" /> 
      <delay default="5000" />    
      <referenceFilters>
        <filter 
            class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="include" >
        http://www.lefigaro.fr/.*
        </filter>
      </referenceFilters>
    </crawler>
  </crawlers>

</httpcollector>

file size limit

hi there
I have a question regarding the importer, Is it possible to limit the content size of a File, I am having issues with a some large files in MS-Excel, and I would like to just index a couple first MB instead of the 45MB os the full file.

could you please point me out some resource or give me some clues how to deal with that large files?
thanks a lot
best regards
angelo

[DOMSplitter] JSoup issue with norconex-importer 2.5.2

Hi!

I Get the following exception when I use the DOMSplitter :

java.lang.NoSuchMethodError: org.jsoup.nodes.Element.cssSelector()Ljava/lang/String;
at com.norconex.importer.handler.splitter.impl.DOMSplitter.splitApplicableDocument(DOMSplitter.java:151)

With configuration:

<importer>
                <preParseHandlers>
                <splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
                        selector=".caption"  sourceCharset="UTF-8"/>
[...]
</importer>

norconex / importer Goto Github PK

importer's People

Contributors

Stargazers

Watchers

Forkers

importer's Issues

Recommend Projects

Recommend Topics

Recommend Org