Giter Site home page Giter Site logo

rug-compling / alpinocorpus Goto Github PK

View Code? Open in Web Editor NEW
8.0 4.0 1.0 1.34 MB

Library for handling Alpino corpora

License: GNU Lesser General Public License v2.1

C++ 90.23% C 2.45% XSLT 1.20% CMake 0.08% Ragel 1.35% Python 1.35% Nix 0.90% Meson 2.43%
alpino treebank berkeley-db-xml xml-treebank parse lassy

alpinocorpus's People

Contributors

danieldk avatar jelmervdl avatar larsmans avatar pebbe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

evdmade01

alpinocorpus's Issues

Provide an interface for discovering corpora

For non-local corpora, there is currently no interface for discovering which corpora are available. For RemoteCorpusReader, this is done in Dact. It would be better to have a general interface for remote corpora.

Linking fails

Uitvoer van cmake .:

-- Boost version: 1.62.0
-- Found the following Boost libraries:
--   system
--   chrono
--   date_time
--   filesystem
--   thread
--   regex
--   atomic

Uitvoer van make:

[ 49%] Linking CXX shared library libalpino_corpus.so
/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/libboost_filesystem.a(operations.o): relocation R_X86_64_PC32 against symbol `_ZN5boost6detail15sp_counted_base7destroyEv' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value

lek in iterator

Ik heb een C-programmaatje dat alle entries (naam en inhoud) uit een stel dact-bestanden leest. Het geheugengebruik blijft daarbij constant groeien. Na verwerking van bijna 100000 entries (in tien dact-bestanden) geeft top dit aan (op machine urd):

  PID USER      VIRT  RES  SHR SWAP %MEM %CPU   TIME COMMAND
23667 alfa      794m 704m  10m  89m 35.0   20   1:08 dctest

Hier is het programmaatje:

#include <stdio.h>
#include <AlpinoCorpus/capi.h>

int main (int argc, char *argv [])
{
    int
        i;
    long
        count = 0;
    alpinocorpus_reader
        r;
    alpinocorpus_iter
        it;
    alpinocorpus_entry
        ent;

    alpinocorpus_initialize();
    for (i = 1; i < argc; i++) {
        r = alpinocorpus_open(argv[i]);
        it = alpinocorpus_entry_iter(r);
        while (alpinocorpus_iter_has_next(r, it)) {
            ent = alpinocorpus_iter_next(r, it);
            printf("%li %s\n", ++count, alpinocorpus_entry_name(ent));
            alpinocorpus_entry_contents(ent);
            alpinocorpus_entry_free(ent);
        }
        alpinocorpus_close(r);
    }

    return 0;
}

Nog een bug: als ik bovenstaand programma de twee includes omwissel, dan krijg ik van de compiler een foutmelding:

In file included from dctest.c:1:
/my/opt/alpino/include/AlpinoCorpus/capi.h:63: error: expected declaration specifiers or ‘...’ before ‘size_t’
/my/opt/alpino/include/AlpinoCorpus/capi.h:127: error: expected declaration specifiers or ‘...’ before ‘size_t’
/my/opt/alpino/include/AlpinoCorpus/capi.h:146: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘alpinocorpus_size’
make: *** [dctest] Fout 1

Consider removing remote corpora support

This use case is now covered by PaQu. I also think that there is no remaining server running, so this only potentially confuses library users.

Since the classes are in git, we can always choose to revive it when necessary.

Lazy opening of corpora

For the DirectoryCorpusReader this would be implementable.

Currently, Dact tries to open each argument supplied through ac::CorpusReaderFactory::open and adds each CorpusReader instance to a MultiCorpusReader.

One way to implement lazy opening of corpora would be to extend the MultiCorpusReader with an additional constructor. One could supply the paths instead of the corpora and the reader would open and close the corpora when needed. This would change the public interface of MultiCorpusReader but is probably the easiest way to implement it.

Another method might be to add additional methods to the CorpusReader interface to signal it is no longer actively used. The MultiCorpusReader could then signal the previous CorpusReader that it won't query it for some time, and start querying the next one. The public interface would remain the same, but the behavior of CorpusReader instances would slightly change. Corpora would no longer be opened as soon as an instance of CorpusReader is created, and as a result the error checking and even the try-catch statements of ac::CorpusReaderFactory::open would fail their purpose.

constructor van iterator in RemoteCorpusReader wacht op data

De constructor van de iterator in RemoteCorpusReader hangt zolang er nog geen data is ontvangen. Dit is een probleem als de server lang moet zoeken naar een eerste match. (Omdat de server wacht met het sturen van headers tot er data is die verstuurd moet worden.)

Nu wordt dat als volgt opgelost: De server stuurt direct een regel bestaande uit Ctrl-B, en die regel wordt door de klasse GetUrl genegeerd. Dit is niet fraai.

Alternatieven:

  1. De constructor van GetUrl opent geen webadres, maar laat dit over aan de eerste aanroep van GetUrl.line() of GetUrl.body(), of een andere methode die iets met de te ontvangen data moet doen.
  2. De constructor van de iterator in RemoteCorpus roept niet GetUrl aan, maar laat dit over aan de eerste next() of get().

Location API documentation

I'll remove API documentation from the gh-pages branch, each diff is huge and tracking history has no added value.

As a substitute, I'll run a cron script somewhere that runs a 'git pull ; doxygen'.

Proposal: queryWithStylesheet

We have long been considering whether to add support for XSL tranformations to alpinocorpus. Until now, we decided against, because it was cleaner to let the library user do transformations. However, this poses a problem to a future RemoteCorpusReader: it should implement the CorpusReader interface (and nothing more), but we cannot realistically expect a client to download every XML file matching a query to apply transformations (e.g. in the sentence widget of Dact).

We cannot just implement transformations in the server, because then only RemoteCorpusReader would provide this functionality.

Summary: we need library/server-side XSL transformations.

My proposal is to add a method to CorpusReader:

EntryIterator CorpusReader::queryWithStylesheet(QueryDialect d, std::string const &q,
    std::string const &stylesheet, std::list<MarkerQuery> const &markerQueries) const;

If this method is called, it will return a normal EntryIterator, with an overloaded version of the ill-fated contents() method. Calling the contents() method would:

  • Retrieve the XML file.
  • Apply the given stylesheet.
  • Return the result of the transformation.

The markerQueries argument can be used to mark nodes using queries (like readMarkQueries). We could have some default behavior, where, if the argument is unspecified, it will use the query specified in the second argument to mark nodes.

This change would make it possible to:

  • Perform transformations library/server side, making RemoteCorpusReader more speedy.
  • Potentially solve the problem we have in the statistics window (matching nodes, without the given attribute).
  • If a comparable read method is provided (readWithStylesheet?), eliminate the libxslt1 dependency in Dact.

Any comments?

Refactor CorpusReader

There far too much publicly visible in CorpusReader, for instance FilterIter and StylesheetIter.

Such implementation details should be made private.

Cannot build both i386 and x86_64 on OS X

I can't seem to get dbxml compiled with multiple -arch arguments. My gcc compiler (i686-apple-darwin10-g++-4.2.1) exists on "g++-4.2: -E, -S, -save-temps and -M options are not allowed with multiple -arch flags" when I add "-arch i386 -arch x86_64" to CXXFLAGS/CFLAGS.

Building it just x86_64, which seems to be the default choice of my compiler, works fine. But it did require some modifications to the CMakeLists.txt (http://mirror.ikhoefgeen.nl/CMakeLists.diff)

Caching in RemoteCorpusReader without query

At the time of writing, this is in branch 'RemoteCorpus'.

When asking RemoteCorpusReader for the list of entries without query, and then the process is interrupted, and then later it is done again, RemoteCorpusReader restarts retrieving entries from the server from the beginning.

Should RemoteCorpus be changed, to start retrieving entries from the point it was cut off in the previous run?

Remove libxml2 as a dependency

We are now using libxml2 as a dependency, as well as Xerces-C and XQilla via dbxml. This has some disadvantages:

  • Unnecessary dependencies.
  • We could provide XQuery/XPath 2.0 support for compact and directory-based corpora via XQilla.
  • Some queries do not validate in the Dact user interface, but are valid in XQilla.

The nolibxml branch attempts to eliminate the use of libxml2.

DbCorpusReader::contents() not working

DbCorpusReader::contents() returns an empty string.

Simple file to test:

#include <AlpinoCorpus/CorpusReader.hh>
#include <AlpinoCorpus/CorpusReaderFactory.hh>
#include <iostream>

int main(int argc, char *argv[])
{
    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " corpusfile.dact" << std::endl;
        return 1;
    }

    alpinocorpus::CorpusReader *r = alpinocorpus::CorpusReaderFactory::open(argv[1]);

    for (alpinocorpus::CorpusReader::EntryIterator i =
             r->query(alpinocorpus::CorpusReader::XPATH, "//node[@root=\"fiets\"]");
         i != r->end(); i++)
        std::cout << *i << "\t" << i.contents(*r) << std::endl;

    return 0;
}

Better error handling in the websockets support

The alpinocorpus library has experimental websockets support (besides a more REST-like remote corpus reader). There is some error handling in that reader. Some things that are not dealt with currently, are:

  • Interruption: what should we do if the connection is interrupted for a longer time and the websocket is closed? We should probably reopen it.
  • There is currently no interface for error reporting to users (such as Dact), besides throwing exceptions. E.g. it is not possible for Dact to inform that the connection was closed, but attempts are being made to reconnect.

Implement bidirectional iteration

In directory and compact corpora, we could support bidirectional iteration.

The obvious solution would be to add operator-- to EntryIterator, and throw an exception in iterators that do not support it.

Geen verschil tussen natural_order en numerical_order

Ik zie geen verschil tussen gebruik van natural_order en numerical_order

0.xml
1.xml
10.xml
100.xml
1000.xml
1001.xml
1002.xml
1003.xml
1004.xml
1005.xml
1006.xml
1007.xml
1008.xml
1009.xml
101.xml
1010.xml
1011.xml
1012.xml
1013.xml

Compilations fails on Linux Debian 11

(peter) /my/src git clone https://github.com/rug-compling/alpinocorpus
Cloning into 'alpinocorpus'...
remote: Enumerating objects: 5980, done.
remote: Counting objects: 100% (107/107), done.
remote: Compressing objects: 100% (64/64), done.
remote: Total 5980 (delta 49), reused 59 (delta 35), pack-reused 5873
Receiving objects: 100% (5980/5980), 1.34 MiB | 3.05 MiB/s, done.
Resolving deltas: 100% (3245/3245), done.
(peter) /my/src cd /my/src/alpinocorpus
(peter) /my/src/alpinocorpus rm -rf builddir /my/opt/alpinocorpus
(peter) /my/src/alpinocorpus meson builddir -D dbxml_bundle=/my/opt/dbxml-2 --prefix=/my/opt/alpinocorpus
The Meson build system
Version: 0.56.2
Source dir: /my/src/alpinocorpus
Build dir: /my/src/alpinocorpus/builddir
Build type: native build
Project name: alpinocorpus
Project version: 3.0.0
C++ compiler for the host machine: c++ (gcc 10.2.1 "c++ (Debian 10.2.1-6) 10.2.1 20210110")
C++ linker for the host machine: c++ ld.bfd 2.35.2
Host machine cpu family: x86_64
Host machine cpu: x86_64
Found pkg-config: /usr/bin/pkg-config (0.29.2)
Run-time dependency Boost (found: filesystem, system) found: YES 1.74.0 (/usr)
Run-time dependency libexslt found: YES 0.8.20
Run-time dependency libxml-2.0 found: YES 2.9.10
Run-time dependency libxslt found: YES 1.1.34
Run-time dependency zlib found: YES 1.2.11
Library xerces-c found: YES
Library xqilla found: YES
Library dbxml found: YES
Configuring config.h using configuration
Build targets in project: 9

Found ninja-1.10.1 at /usr/bin/ninja
(peter) /my/src/alpinocorpus ninja -C builddir install
ninja: Entering directory `builddir'
[4/65] Compiling C++ object src/libalpinocorpus.so.3.0.0.p/capi.cpp.o
FAILED: src/libalpinocorpus.so.3.0.0.p/capi.cpp.o 
c++ -Isrc/libalpinocorpus.so.3.0.0.p -Isrc -I../src -Iinclude -I../include -I/my/opt/dbxml-2/include -I/usr/include -I/usr/include/libxml2 -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -std=c++11 -g -fPIC -DBOOST_FILESYSTEM_DYN_LINK=1 -DBOOST_SYSTEM_DYN_LINK=1 -DBOOST_ALL_NO_LIB -MD -MQ src/libalpinocorpus.so.3.0.0.p/capi.cpp.o -MF src/libalpinocorpus.so.3.0.0.p/capi.cpp.o.d -o src/libalpinocorpus.so.3.0.0.p/capi.cpp.o -c ../src/capi.cpp
In file included from /usr/include/unicode/uenum.h:23,
                 from /usr/include/unicode/ucnv.h:53,
                 from /usr/include/libxml2/libxml/encoding.h:31,
                 from /usr/include/libxml2/libxml/parser.h:810,
                 from /usr/include/libxml2/libxml/globals.h:18,
                 from /usr/include/libxml2/libxml/threads.h:35,
                 from /usr/include/libxml2/libxml/xmlmemory.h:218,
                 from /usr/include/libxml2/libxml/tree.h:1307,
                 from /usr/include/libxslt/xsltInternals.h:16,
                 from ../include/AlpinoCorpus/Stylesheet.hh:9,
                 from ../src/capi.cpp:13:
/usr/include/unicode/localpointer.h:67:1: error: template with C linkage
   67 | template<typename T>
      | ^~~~~~~~
In file included from ../src/capi.cpp:13:
../include/AlpinoCorpus/Stylesheet.hh:8:1: note: ‘extern "C"’ linkage started here
    8 | extern "C" {
      | ^~~~~~~~~~
In file included from /usr/include/unicode/uenum.h:23,
                 from /usr/include/unicode/ucnv.h:53,
                 from /usr/include/libxml2/libxml/encoding.h:31,
                 from /usr/include/libxml2/libxml/parser.h:810,
                 from /usr/include/libxml2/libxml/globals.h:18,
                 from /usr/include/libxml2/libxml/threads.h:35,
                 from /usr/include/libxml2/libxml/xmlmemory.h:218,
                 from /usr/include/libxml2/libxml/tree.h:1307,
                 from /usr/include/libxslt/xsltInternals.h:16,
                 from ../include/AlpinoCorpus/Stylesheet.hh:9,
                 from ../src/capi.cpp:13:
/usr/include/unicode/localpointer.h:190:1: error: template with C linkage
  190 | template<typename T>
      | ^~~~~~~~
In file included from ../src/capi.cpp:13:
../include/AlpinoCorpus/Stylesheet.hh:8:1: note: ‘extern "C"’ linkage started here
    8 | extern "C" {
      | ^~~~~~~~~~
In file included from /usr/include/unicode/uenum.h:23,
                 from /usr/include/unicode/ucnv.h:53,
                 from /usr/include/libxml2/libxml/encoding.h:31,
                 from /usr/include/libxml2/libxml/parser.h:810,
                 from /usr/include/libxml2/libxml/globals.h:18,
                 from /usr/include/libxml2/libxml/threads.h:35,
                 from /usr/include/libxml2/libxml/xmlmemory.h:218,
                 from /usr/include/libxml2/libxml/tree.h:1307,
                 from /usr/include/libxslt/xsltInternals.h:16,
                 from ../include/AlpinoCorpus/Stylesheet.hh:9,
                 from ../src/capi.cpp:13:
/usr/include/unicode/localpointer.h:365:1: error: template with C linkage
  365 | template<typename T>
      | ^~~~~~~~
In file included from ../src/capi.cpp:13:
../include/AlpinoCorpus/Stylesheet.hh:8:1: note: ‘extern "C"’ linkage started here
    8 | extern "C" {
      | ^~~~~~~~~~
In file included from /usr/include/unicode/uenum.h:23,
                 from /usr/include/unicode/ucnv.h:53,
                 from /usr/include/libxml2/libxml/encoding.h:31,
                 from /usr/include/libxml2/libxml/parser.h:810,
                 from /usr/include/libxml2/libxml/globals.h:18,
                 from /usr/include/libxml2/libxml/threads.h:35,
                 from /usr/include/libxml2/libxml/xmlmemory.h:218,
                 from /usr/include/libxml2/libxml/tree.h:1307,
                 from /usr/include/libxslt/xsltInternals.h:16,
                 from ../include/AlpinoCorpus/Stylesheet.hh:9,
                 from ../src/capi.cpp:13:
/usr/include/unicode/ucnv.h:585:1: error: conflicting declaration of C function ‘void icu_67::swap(icu_67::LocalUConverterPointer&, icu_67::LocalUConverterPointer&)’
  585 | U_DEFINE_LOCAL_OPEN_POINTER(LocalUConverterPointer, UConverter, ucnv_close);
      | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/unicode/uenum.h:68:1: note: previous declaration ‘void icu_67::swap(icu_67::LocalUEnumerationPointer&, icu_67::LocalUEnumerationPointer&)’
   68 | U_DEFINE_LOCAL_OPEN_POINTER(LocalUEnumerationPointer, UEnumeration, uenum_close);
      | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
[10/65] Compiling C++ object src/libalpinocorpus.so.3.0.0.p/CorpusReader.cpp.o
FAILED: src/libalpinocorpus.so.3.0.0.p/CorpusReader.cpp.o 
c++ -Isrc/libalpinocorpus.so.3.0.0.p -Isrc -I../src -Iinclude -I../include -I/my/opt/dbxml-2/include -I/usr/include -I/usr/include/libxml2 -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -std=c++11 -g -fPIC -DBOOST_FILESYSTEM_DYN_LINK=1 -DBOOST_SYSTEM_DYN_LINK=1 -DBOOST_ALL_NO_LIB -MD -MQ src/libalpinocorpus.so.3.0.0.p/CorpusReader.cpp.o -MF src/libalpinocorpus.so.3.0.0.p/CorpusReader.cpp.o.d -o src/libalpinocorpus.so.3.0.0.p/CorpusReader.cpp.o -c ../src/CorpusReader.cpp
In file included from /usr/include/unicode/uenum.h:23,
                 from /usr/include/unicode/ucnv.h:53,
                 from /usr/include/libxml2/libxml/encoding.h:31,
                 from /usr/include/libxml2/libxml/parser.h:810,
                 from /usr/include/libxml2/libxml/globals.h:18,
                 from /usr/include/libxml2/libxml/threads.h:35,
                 from /usr/include/libxml2/libxml/xmlmemory.h:218,
                 from /usr/include/libxml2/libxml/tree.h:1307,
                 from /usr/include/libxslt/xsltInternals.h:16,
                 from ../include/AlpinoCorpus/Stylesheet.hh:9,
                 from ../src/CorpusReader.cpp:17:
/usr/include/unicode/localpointer.h:67:1: error: template with C linkage
   67 | template<typename T>
      | ^~~~~~~~
In file included from ../src/CorpusReader.cpp:17:
../include/AlpinoCorpus/Stylesheet.hh:8:1: note: ‘extern "C"’ linkage started here
    8 | extern "C" {
      | ^~~~~~~~~~
In file included from /usr/include/unicode/uenum.h:23,
                 from /usr/include/unicode/ucnv.h:53,
                 from /usr/include/libxml2/libxml/encoding.h:31,
                 from /usr/include/libxml2/libxml/parser.h:810,
                 from /usr/include/libxml2/libxml/globals.h:18,
                 from /usr/include/libxml2/libxml/threads.h:35,
                 from /usr/include/libxml2/libxml/xmlmemory.h:218,
                 from /usr/include/libxml2/libxml/tree.h:1307,
                 from /usr/include/libxslt/xsltInternals.h:16,
                 from ../include/AlpinoCorpus/Stylesheet.hh:9,
                 from ../src/CorpusReader.cpp:17:
/usr/include/unicode/localpointer.h:190:1: error: template with C linkage
  190 | template<typename T>
      | ^~~~~~~~
In file included from ../src/CorpusReader.cpp:17:
../include/AlpinoCorpus/Stylesheet.hh:8:1: note: ‘extern "C"’ linkage started here
    8 | extern "C" {
      | ^~~~~~~~~~
In file included from /usr/include/unicode/uenum.h:23,
                 from /usr/include/unicode/ucnv.h:53,
                 from /usr/include/libxml2/libxml/encoding.h:31,
                 from /usr/include/libxml2/libxml/parser.h:810,
                 from /usr/include/libxml2/libxml/globals.h:18,
                 from /usr/include/libxml2/libxml/threads.h:35,
                 from /usr/include/libxml2/libxml/xmlmemory.h:218,
                 from /usr/include/libxml2/libxml/tree.h:1307,
                 from /usr/include/libxslt/xsltInternals.h:16,
                 from ../include/AlpinoCorpus/Stylesheet.hh:9,
                 from ../src/CorpusReader.cpp:17:
/usr/include/unicode/localpointer.h:365:1: error: template with C linkage
  365 | template<typename T>
      | ^~~~~~~~
In file included from ../src/CorpusReader.cpp:17:
../include/AlpinoCorpus/Stylesheet.hh:8:1: note: ‘extern "C"’ linkage started here
    8 | extern "C" {
      | ^~~~~~~~~~
In file included from /usr/include/unicode/uenum.h:23,
                 from /usr/include/unicode/ucnv.h:53,
                 from /usr/include/libxml2/libxml/encoding.h:31,
                 from /usr/include/libxml2/libxml/parser.h:810,
                 from /usr/include/libxml2/libxml/globals.h:18,
                 from /usr/include/libxml2/libxml/threads.h:35,
                 from /usr/include/libxml2/libxml/xmlmemory.h:218,
                 from /usr/include/libxml2/libxml/tree.h:1307,
                 from /usr/include/libxslt/xsltInternals.h:16,
                 from ../include/AlpinoCorpus/Stylesheet.hh:9,
                 from ../src/CorpusReader.cpp:17:
/usr/include/unicode/ucnv.h:585:1: error: conflicting declaration of C function ‘void icu_67::swap(icu_67::LocalUConverterPointer&, icu_67::LocalUConverterPointer&)’
  585 | U_DEFINE_LOCAL_OPEN_POINTER(LocalUConverterPointer, UConverter, ucnv_close);
      | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/unicode/uenum.h:68:1: note: previous declaration ‘void icu_67::swap(icu_67::LocalUEnumerationPointer&, icu_67::LocalUEnumerationPointer&)’
   68 | U_DEFINE_LOCAL_OPEN_POINTER(LocalUEnumerationPointer, UEnumeration, uenum_close);
      | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /my/opt/dbxml-2/include/xqilla/utils/UTF8Str.hpp:28,
                 from /my/opt/dbxml-2/include/xqilla/xqilla-dom3.hpp:28,
                 from ../src/CorpusReader.cpp:32:
/my/opt/dbxml-2/include/xercesc/util/XMLUTF8Transcoder.hpp: In member function ‘void xercesc_3_0::XMLUTF8Transcoder::checkTrailingBytes(XMLByte, unsigned int, unsigned int) const’:
/my/opt/dbxml-2/include/xercesc/util/XMLUTF8Transcoder.hpp:110:25: warning: narrowing conversion of ‘(XMLByte)toCheck’ from ‘XMLByte’ {aka ‘unsigned char’} to ‘char’ [-Wnarrowing]
  110 |         char byte[2] = {toCheck,0};
      |                         ^~~~~~~
[13/65] Compiling C++ object src/libalpinocorpus.so.3.0.0.p/DbCorpusWriter.cpp.o
ninja: build stopped: subcommand failed.

Implement dtget functionality

alpinocorpus now has alpinocorpus-extract to extract a Dact or compact corpus, but most of the functionality of dtget is missing, such as printing one particular entry to stdout.

Remove readMarkQueries

This functionality could be provided by read(), using an empty list of marker queries by default.

import stylesheet from stylesheet does not work if called from different directory

this works with dtxslt:

dtxslt -s Scripts/plat.xsl Enhanced/wiki-1846

here, Scripts/plat.xsl imports another sylesheet (alp.basename.xsl), also from the Scripts directory.

Trying the same with alpinocorpus-xslt, I get an error:

alpinocorpus-xslt Scripts/plat.xsl Enhanced/wiki-1846
I/O warning : failed to load external entity "alp.basename.xsl"
compilation error: element import
xsl:import : unable to load alp.basename.xsl

too bad, since dtxslt does not know about .dact files...

Cannot cancel opening of corpus

Since opening a file happens at the construction of the object, we cannot tell the object to stop opening a file (or query its progress) For the DirectoryCorpusReader this might be really useful. As far as I understand it isn't possbile to implement this in the DbCorpusReader because dbxml does not support it.

Todo: determine the priority of this feature

The problem:
Cannot cancel the open file operation, wich can take a lot of time for big corpusses.

Possible solutions:

  1. Kill the thread that opens a corpus when the user presses the cancel button. But this implies an unclean shutdown, with possibly corrupting indexes and leaving open file handles.
  2. Keep the thread running, but ignore all its signals. This will allow the thread to stop cleanly, but doesn't free up the resources used while opening the corpus. Also I think this is harder to implement in Dact.
  3. Lazy loading. On construction of the corpusreader, don't just yet start loading it, but delay this action till an iterator is requested. This way we have an object which we can tell to stop (using a close() method or something)
  4. Separate open() from the constructor, add a close() method to the reader which can be called while open() is busy to cancel open() or any other action and invalidate the object.
  5. Some sort of progress indicator which we pass to the constructor. The constructor queries this indicator while opening the corpus. From Dact, we can tell the indicator to set its status to 'stop', which the reader can query, and stops reading. But this feels like a hack and leaves us with an unitialized object.

I think 3 or 4 are the best options. I think I would prefer 4 above 3 because it is less magical.

GetUrl.line() kan einde van data niet detecteren

GetUrl.line() onderbreekt als GetUrl.interrupt() is aangeroepen. Hoe dit is geïmplementeerd verhindert de detectie van 'einde data' van de server. Daarom moet de server afsluiten met een regel met Ctrl-D. Anders blijft GetUrl.line() proberen nieuwe regels te lezen.

Is dit zo aan te passen dat het sturen van een Ctrl-D niet nodig is?

MultiCorpusReader does not return statistics

When the multicorpusreader is used in Dact, the statisticswindow does not contain any values.
(To test, start dact from the commandline with multiple corpora as arguments)

edit: I think MultiCorpusReaderPrivate::MultiIter::contents() is missing.

Access xml node value

For the statistics window in Dact it would be really useful if we could access the value of the maching nodes. So for example I could create a query //node[@pt="ww"]/@root and I could query the iterator for this query for the filename of the xml file that matched, and for the string value of @root of the matched node.

edit: to do this correctly, it might be best to merge the search functionality that is now divided between XPathMapper in Dact and runQuery for the dbxml corpuses into alpinocorpus. Let's support runQuery for all the corpus types.

alpinocorpus won't compile

(ALPINOCORPUS_VERSION "1.2.0")
I get::

[  2%] Building CXX object CMakeFiles/alpino_corpus.dir/src/CompactCorpusWriter.cpp.o
In file included from /home/peter/alpino/alpinocorpus/include/AlpinoCorpus/CorpusReader.hh:10,
                 from /home/peter/alpino/alpinocorpus/include/AlpinoCorpus/CompactCorpusWriter.hh:6,
                 from /home/peter/alpino/alpinocorpus/src/CompactCorpusWriter.cpp:3:
/home/peter/alpino/alpinocorpus/include/AlpinoCorpus/IterImpl.hh:4:33: error: AlpinoCorpus/Entry.hh: Bestand of map bestaat niet
In file included from /home/peter/alpino/alpinocorpus/include/AlpinoCorpus/CorpusReader.hh:10,
                 from /home/peter/alpino/alpinocorpus/include/AlpinoCorpus/CompactCorpusWriter.hh:6,
                 from /home/peter/alpino/alpinocorpus/src/CompactCorpusWriter.cpp:3:
/home/peter/alpino/alpinocorpus/include/AlpinoCorpus/IterImpl.hh:16: error: ‘Entry’ does not name a type
In file included from /home/peter/alpino/alpinocorpus/include/AlpinoCorpus/CompactCorpusWriter.hh:6,
                 from /home/peter/alpino/alpinocorpus/src/CompactCorpusWriter.cpp:3:
/home/peter/alpino/alpinocorpus/include/AlpinoCorpus/CorpusReader.hh:35: error: ‘Entry’ does not name a type
make[2]: *** [CMakeFiles/alpino_corpus.dir/src/CompactCorpusWriter.cpp.o] Fout 1
make[1]: *** [CMakeFiles/alpino_corpus.dir/all] Fout 2
make: *** [all] Fout 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.