Giter Site home page Giter Site logo

clearnlp's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clearnlp's Issues

Decoding with a new trained model

Hello Jinho,

I managed to get a new trained pos tagging model, but I'm having trouble decoding with my model.

I went as far as renaming my model to "general-en-pos.xz", replacing the original model in the "clearnlp-general-en-pos-3.2.jar" with mine and creating a new jar keeping the name "clearnlp-general-en-pos-3.2.jar" the same. It still gives me the error below.

java.io.IOException: Stream closed
at java.io.BufferedInputStream.getInIfOpen(BufferedInputStream.java:159)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.tukaani.xz.SingleXZInputStream.initialize(Unknown Source)
at org.tukaani.xz.SingleXZInputStream.(Unknown Source)
at org.tukaani.xz.XZInputStream.(Unknown Source)
at org.tukaani.xz.XZInputStream.(Unknown Source)
at edu.emory.clir.clearnlp.component.utils.NLPUtils.getObjectInputStream(NLPUtils.java:173)
at edu.emory.clir.clearnlp.component.utils.NLPUtils.getPOSTagger(NLPUtils.java:94)
at edu.emory.clir.clearnlp.bin.NLPDecode.getComponents(NLPDecode.java:207)
at edu.emory.clir.clearnlp.bin.NLPDecode.decode(NLPDecode.java:89)
at edu.emory.clir.clearnlp.bin.NLPDecode.(NLPDecode.java:74)
at edu.emory.clir.clearnlp.bin.NLPDecode.main(NLPDecode.java:236)
log4j:WARN No appenders could be found for logger (edu.emory.clir.clearnlp.util.BinUtils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.lang.NullPointerException
at edu.emory.clir.clearnlp.component.AbstractStatisticalComponent.load(AbstractStatisticalComponent.java:157)
at edu.emory.clir.clearnlp.component.AbstractStatisticalComponent.initDecode(AbstractStatisticalComponent.java:126)
at edu.emory.clir.clearnlp.component.AbstractStatisticalComponent.(AbstractStatisticalComponent.java:99)
at edu.emory.clir.clearnlp.component.mode.pos.AbstractPOSTagger.(AbstractPOSTagger.java:60)
at edu.emory.clir.clearnlp.component.mode.pos.EnglishPOSTagger.(EnglishPOSTagger.java:49)
at edu.emory.clir.clearnlp.component.utils.NLPUtils.getPOSTagger(NLPUtils.java:87)
at edu.emory.clir.clearnlp.component.utils.NLPUtils.getPOSTagger(NLPUtils.java:94)
at edu.emory.clir.clearnlp.bin.NLPDecode.getComponents(NLPDecode.java:207)
at edu.emory.clir.clearnlp.bin.NLPDecode.decode(NLPDecode.java:89)
at edu.emory.clir.clearnlp.bin.NLPDecode.(NLPDecode.java:74)
at edu.emory.clir.clearnlp.bin.NLPDecode.main(NLPDecode.java:236)
Exception in thread "main" java.lang.NullPointerException
at edu.emory.clir.clearnlp.component.mode.pos.AbstractPOSTagger.createStringFeatureVector(AbstractPOSTagger.java:120)
at edu.emory.clir.clearnlp.component.mode.pos.AbstractPOSTagger.createStringFeatureVector(AbstractPOSTagger.java:33)
at edu.emory.clir.clearnlp.component.AbstractStatisticalComponent.decode(AbstractStatisticalComponent.java:294)
at edu.emory.clir.clearnlp.component.AbstractStatisticalComponent.process(AbstractStatisticalComponent.java:267)
at edu.emory.clir.clearnlp.component.mode.pos.AbstractPOSTagger.process(AbstractPOSTagger.java:105)
at edu.emory.clir.clearnlp.bin.NLPDecode.process(NLPDecode.java:163)
at edu.emory.clir.clearnlp.bin.NLPDecode.process(NLPDecode.java:153)
at edu.emory.clir.clearnlp.bin.NLPDecode.decode(NLPDecode.java:107)
at edu.emory.clir.clearnlp.bin.NLPDecode.(NLPDecode.java:74)
at edu.emory.clir.clearnlp.bin.NLPDecode.main(NLPDecode.java:236)

Would appreciate help with moving forward.

Thank you!

Minor bugs in tokenisation / tagging

Hi, I recently came across ClearNLP and decided to give it a try.
I've used other NLP frameworks in the past and I am positively impressed by the overall quality of yours.
I've found some little bugs that I'm going to list below, hoping they will help you to make ClearNLP become even better.

I've used the dataset of sentences here and I analysed them as you suggest to do when you show how to use the API. I arranged the results by _POS Tag_, counting the occurrences of word (case insensitive). These are the results that I got:

POS-Tags Tokens
'' " (18)
, , (318), (5), / (3), worldwide. (1), (1)
-LRB- [ (62), ( (31)
-RRB- ] (62), ) (31)
. . (151), ... (1), ? (1)
: : (9), ; (3), (2)
ADD ɡreɪt (1)
CC and (166), or (47), but (7), either (5), Yet (1), both (1), ənd (1)
CD 1 (15), one (15), 3 (9), 2 (8), 4 (5), 5 (4), 6 (4), 8 (4), million (3), two (3), 10 (2), 11 (2), 12 (2), 14 (2), 1886 (2), 7 (2), 9 (2), four (2), three (2), 1/100 (1), 13 (1), 1397. (1), 1472. (1), 15 (1), 16 (1), 17 (1), 18 (1), 190 (1), 1901 (1), 1926 (1), 1952 (1), 1986. (1), 20 (1), 20-30 (1), 2000 (1), 2010 (1), 243,610 (1), 500 (1), 64.1 (1), 94,060 (1), billion (1), eight (1), fourteen (1)
DT the (269), A (122), an (17), this (16), These (10), another (7), some (6), Both (4), that (4), each (3), any (2), those (2), all (1), no (1)
EX there (2)
FW i.e. (2), commodity. (1), etc (1), societies. (1)
HYPH - (27), (2), (1)
IN of (189), in (116), to (52), as (35), for (34), by (31), with (22), on (20), from (17), between (12), at (10), while (8), during (5), that (5), Due (3), if (3), into (3), than (3), through (3), under (3), within (3), about (2), against (2), along (2), over (2), since (2), throughout (2), Although (1), Unlike (1), after (1), among (1), because (1), off (1), once (1), pairwise (1), so (1), towards (1), underlie (1), until (1), upon (1), whether (1), without (1), ˈbrɪtən (1)
JJ social (49), other (17), financial (15), such (13), many (10), public (8), first (6), German (5), important (5), ancient (4), complex (4), global (4), modern (4), same (4), specific (4), academic (3), amino (3), certain (3), computational (3), current (3), different (3), early (3), equal (3), genetic (3), governmental (3), individual (3), major (3), scientific (3), various (3), whole (3), 14th (2), 20th (2), Further (2), Human (2), New (2), accessible (2), behavioral (2), biological (2), constitutional (2), dependent (2), discrete (2), general (2), influential (2), international (2), judicial (2), large (2), later (2), latter (2), long (2), metabolic (2), national (2), nonsexual (2), northern (2), particular (2), practical (2), principal (2), real (2), regulatory (2), rich (2), several (2), sociological (2), sovereign (2), statistical (2), subject (2), theoretical (2), urban (2), 15th (1), 16th (1), 17th (1), 18th (1), 22nd-most (1), Abnormal (1), American (1), British (1), European (1), Italian (1), Judicious (1), Short (1), Strong (1), abstract (1), active (1), additional (1), adjacent (1), administrative (1), alternative (1), asymmetric (1), available (1), average (1), basic (1), callable (1), central (1), centralized (1), chemical (1), common (1), consistent (1), contemporary (1), critical (1), cultural (1), curious (1), detailed (1), devoid (1), dimensional (1), direct (1), dyadic (1), eastern (1), electrical (1), empirical (1), epistemological (1), executable (1), executive (1), famous (1), firm (1), flexible (1), formal (1), former (1), fractional (1), fundamental (1), generic (1), geographic (1), great (1), half (1), hermeneutic (1), institutional (1), interconnected (1), interdisciplinary (1), interested (1), internal (1), interpersonal (1), interpretative (1), intractable (1), key (1), legal (1), legislative (1), like (1), linear (1), linguistic (1), liquid (1), local (1), locomotive (1), logistical (1), macro (1), mammalian (1), mathematical (1), meaningless (1), medical (1), medieval (1), methodical (1), metropolitan (1), micro (1), mid-twentieth (1), military (1), minimum (1), monetary (1), much (1), multinational (1), multiple (1), nascent (1), natural (1), next (1), non-governmental (1), non-peptide (1), non-profit (1), notional (1), nucleotide (1), only (1), open (1), opposite (1), parliamentary (1), penal (1), peptide (1), pervasive (1), philosophic (1), physical (1), populous (1), posttranslational (1), powerful (1), prime (1), principled (1), professional (1), prosthetic (1), qualitative (1), quantitative (1), recent (1), regional (1), responsible (1), retail (1), rigorous (1), second (1), self-powered (1), small (1), socialist (1), societal (1), spatial (1), square (1), stable (1), standard (1), structural (1), structured (1), suburban (1), symmetric (1), systematic (1), threaded (1), traditional (1), true (1), typical (1), unneeded (1), unstable (1), usable (1), useful (1), vast (1), vehicular (1), visual (1), western (1), wheeled (1), wide (1)
JJR more (2), larger (1), less (1), smaller (1)
JJS largest (6), most (5), oldest (3), best (2), least (1)
MD can (16), may (13), should (2), might (1), will (1)
NFP (s) (1), :- (1), ;-) (1)
NN network (19), credit (16), theory (14), bank (13), Computer (11), graph (11), money (11), banking (9), car (9), loan (9), policy (9), programming (9), sociology (9), analysis (8), behavior (8), century (8), protein (8), subroutine (8), account (7), study (7), system (7), activity (6), amount (6), part (6), program (6), sequence (6), structure (6), use (6), world (6), acid (5), borrower (5), capital (5), lender (5), research (5), science (5), set (5), society (5), state (5), term (5), transportation (5), unit (5), amino (4), area (4), code (4), computation (4), cost (4), country (4), debt (4), default (4), entity (4), interest (4), land (4), number (4), party (4), place (4), time (4), vehicle (4), Internet (3), air (3), centrality (3), change (3), communication (3), concept (3), context (3), development (3), form (3), function (3), health (3), interaction (3), level (3), luxury (3), market (3), nature (3), north (3), object (3), period (3), protection (3), seller (3), subprogram (3), task (3), value (3), variety (3), year (3), Web (2), action (2), agency (2), animal (2), approach (2), automobile (2), balance (2), basis (2), border (2), city (2), class (2), customer (2), date (2), dei (2), deposit (2), discipline (2), driving (2), example (2), field (2), finance (2), folding (2), freedom (2), fuel (2), gasoline (2), gene (2), history (2), information (2), insurance (2), island (2), knowledge (2), language (2), law (2), lending (2), life (2), lifespan (2), maintenance (2), manufacturer (2), method (2), mobility (2), name (2), parking (2), physics (2), pollution (2), polypeptide (2), preference (2), range (2), rate (2), receivable (2), representation (2), response (2), return (2), risk (2), role (2), sense (2), spread (2), swap (2), trade (2), trading (2), transport (2), usage (2), way (2), 1–2 (1), 2007–200 (1), Archaeology (1), Bisociality (1), DNA (1), Government (1), List (1), Monosociality (1), P (1), Road (1), Subject (1), ability (1), access (1), acquisition (1), addition (1), agent (1), aggression (1), altruism (1), application (1), archaea (1), array (1), article (1), asset (1), association (1), auto (1), baggage (1), barter (1), behaviour (1), biology (1), birth (1), body (1), bond (1), branch (1), brand (1), business (1), buyer (1), call (1), care (1), case (1), cause (1), cell (1), cell. (1), centre (1), chain (1), coast (1), combustion (1), comfort (1), complexity (1), computing (1), conditioning (1), consumer (1), continuation (1), contract (1), contrast (1), convenience. (1), creation (1), creator (1), credere (1), creditor (1), creditworthiness (1), crisis (1), crossover (1), deal (1), debate (1), debtor (1), decision (1), defence (1), deflagration (1), demand (1), depreciation (1), description (1), design (1), destruction (1), deviance (1), diesel (1), disease (1), disorder (1), distinction (1), division (1), east (1), economy (1), edge (1), education (1), end (1), engine (1), engineering (1), entertainment (1), equity (1), ethanol (1), everyone (1), evidence (1), exchange (1), execution (1), existence (1), expression (1), feasibility (1), fee (1), flow (1), focus (1), foundation (1), funding (1), gas (1), generation (1), glossary (1), goal (1), governance. (1), grain (1), group (1), guide (1), headquarters (1), hierarchy (1), holder (1), importance (1), incentive (1), independence (1), individual (1), influence (1), infrastructure (1), injury (1), institution (1), instruction (1), intercity (1), intermediary (1), interplay (1), interurban (1), invention (1), inventor (1), investigation (1), job (1), justice (1), legislation (1), leisure (1), liability (1), liquidity (1), location (1), locomotive (1), machinery (1), macro. (1), mainland (1), making (1), manner (1), material (1), matter (1), meaning (1), mechanism (1), mechanization (1), memory (1), merchant (1), modelling (1), modification (1), monarch (1), monarchy (1), motor (1), motorcar (1), movement (1), navigation (1), nb (1), note (1), nothing (1), oder (1), order (1), organisation (1), organization (1), par (1), parenthesis (1), particle (1), passenger (1), payment (1), payment. (1), percent (1), person (1), perspective (1), petrol (1), physiology (1), point (1), popularity (1), population (1), portion (1), power (1), practice (1), predation (1), premium (1), price (1), principal (1), problem (1), procedure (1), process (1), processing (1), production (1), pronunciation (1), proof (1), prototype (1), provider (1), provision (1), psychology (1), purest (1), pyrrolysine (1), quality (1), rail (1), railroad (1), realisation (1), reallocation (1), receiver (1), regard (1), regulation (1), relation (1), reliability. (1), religion (1), repayment (1), representation. (1), reputation (1), reserve (1), responsibility (1), result (1), revenue (1), rise (1), routine (1), safety (1), scale (1), scapegoating (1), science. (1), scientist (1), sea (1), seating (1), secularisation (1), selling (1), sentence (1), service (1), sex (1), sexuality (1), sharing (1), sheet (1), size (1), slogan (1), software (1), solution (1), source (1), south (1), space (1), sq (1), stability (1), step (1), storage (1), stratification (1), structure. (1), support (1), syntax (1), synthesis (1), systems. (1), tax (1), today (1), tool (1), topic (1), traffic (1), transfer (1), translation (1), travel (1), trust (1), turn (1), turnover (1), type (1), umbrella (1), understanding (1), vertex (1), wealth (1), welfare (1), wellbeing (1), word (1), world. (1)
NNP Benz (9), UK (6), United (6), Europe (5), Mercedes (5), Britain (4), Ireland (4), Italy (4), Kingdom (4), Daimler (3), Great (3), Northern (3), Siena (3), China (2), Empire (2), English (2), Financial (2), Florence (2), India (2), Ireland. (2), Karl (2), London (2), Medici (2), Monte (2), Motorwagen (2), Ocean (2), Paschi (2), Patent (2), Renaissance (2), Republic (2), Roman (2), Sea (2), States (2), Union (2), di (2), (2), AG (1), Allen (1), America (1), Amsterdam (1), Analysis (1), Association (1), Assyria (1), Atlantic (1), Audi (1), BC (1), BMW (1), Babylonia (1), Baden (1), Bank (1), Bardi (1), Basel (1), Belfast (1), Berenberg (1), Berenbergs (1), Beste (1), Big (1), Cardiff (1), Channel (1), ClearNLP (1), Company (1), Comparing (1), Crown (1), Das (1), David (1), Douglas (1), Dutch (1), Edinburgh (1), Elizabeth (1), England (1), Euler (1), Europe. (1), European (1), Eurostat. (1), Falkland (1), February (1), Ford (1), Franklin (1), Gale (1), Genoa (1), Georg (1), Germany (1), Gesellschaft (1), Gibraltar (1), Gill (1), Giovanni (1), Greece (1), Guernsey (1), Gurusamy (1), Holy (1), II (1), Indian (1), Irish (1), Isle (1), Jacob (1), Jersey (1), Königsberg (1), Latin (1), Listeni (1), Man (1), Management (1), Maurice (1), Medicis (1), Model (1), Moreno (1), Motor (1), Motoren (1), North (1), Overseas (1), Peruzzi (1), Policy (1), Public (1), Queen (1), Scotland (1), Seven (1), Simmel (1), Soviet (1), Stanley (1), Stuttgart (1), T (1), Territory (1), U.S. (1), Venice (1), Wales (1), Western (1), Wheeler (1), Wide (1), Wilkes (1), World (1), Württemberg (1), mi (1), selenocysteine (1), ˈaɪərlənd (1), ˈnɔrðərn (1), ˈproʊti.ɨnz (1), ˈproʊˌtiːnz (1), (1)
NNPS Systems (2), Accords (1), Bridges (1), Fuggers (1), Islands (1), NICs. (1), Rothschilds (1), Services (1), Territories (1), Welsers (1)
NNS networks (10), Proteins (9), banks (8), cars (8), systems (8), relations (7), residues (7), benefits (5), countries (5), institutions (5), methods (5), objects (5), subroutines (5), vehicles (5), Applications (4), loans (4), people (4), sciences (4), Behaviors (3), Examples (3), approaches (3), cities (3), definitions (3), economies (3), fields (3), functions (3), goods (3), graphs (3), markets (3), mathematics (3), members (3), programs (3), regulations (3), resources (3), roads (3), species (3), structures (3), theories (3), vertices (3), acids (2), actors (2), automakers (2), bonds (2), branches (2), challenges (2), concepts (2), controls (2), costs (2), decisions (2), disciplines (2), dynamics (2), edges (2), entities (2), fuels (2), funds (2), genes (2), groups (2), humans (2), innovations (2), issues (2), languages (2), laws (2), libraries (2), lines (2), molecules (2), nodes (2), operations (2), opportunities (2), organisations (2), organisms (2), origins (2), others (2), parts (2), patterns (2), policies (2), problems (2), properties (2), researchers (2), restrictions (2), scholars (2), services (2), students (2), techniques (2), terms (2), things (2), transactions (2), 1930s (1), 1950s (1), 1980s. (1), Movements (1), Subprograms (1), accidents (1), actions (1), activities (1), acts (1), administrations (1), administrators (1), affiliations (1), algorithms (1), alternatives (1), analysis. (1), aqueducts (1), aspects (1), assets (1), automobiles (1), bits (1), books (1), borrowers. (1), buses (1), calls (1), capitals (1), carriages (1), carts (1), cells (1), centuries (1), chains (1), changes (1), citizenship. (1), classes (1), coaches (1), cofactors (1), complexes (1), computations (1), computers (1), constitutions (1), contracts (1), courses (1), covenants (1), customs (1), days (1), deaths (1), decades (1), deposits (1), developers (1), developments (1), dynasties (1), economics (1), educators (1), emoticons (1), failures (1), families (1), farmers (1), fees (1), fields. (1), focuses (1), goods. (1), graphics (1), ideas (1), implications (1), indicators (1), individuals (1), inhabitants. (1), institutions. (1), instructions (1), instruments (1), interpretations (1), investors (1), islands (1), kilometres (1), lawmakers (1), lenders (1), liabilities (1), macromolecules (1), magnates (1), makers (1), managers (1), masses (1), materials (1), measures (1), merchants (1), minutes (1), mɛʁˈt͡seːdəs (1), nations (1), networks. (1), nichts (1), numbers (1), obligations (1), organizations (1), origin. (1), paradigms (1), parties (1), passengers (1), peptides (1), pipes (1), places (1), planners (1), points (1), politicians (1), polypeptides (1), powers (1), practices (1), practitioners (1), priorities (1), procedures (1), processes (1), professors (1), railways (1), reactions (1), relationships (1), repairs (1), representatives (1), requirements (1), roots (1), routes (1), savers (1), savings (1), schools (1), services. (1), sexes (1), shyness. (1), smileys (1), societies (1), sociograms (1), sociologists (1), spaces (1), spheres (1), spreaders (1), standards (1), statistics (1), stimuli (1), streets (1), structures. (1), subfields (1), subjects (1), substrates (1), systems. (1), tabulations (1), tasks (1), taxes (1), telecommunications (1), temples (1), ties (1), times (1), topics (1), traders (1), transfers (1), triads (1), trucks (1), turns (1), types (1), universities (1), variations (1), warming. (1), ways (1), wheels (1), workers (1), years (1)
POS 's (10), ' (1), ˈbɛnt͡s (1)
PRP It (10), they (8), itself (4), them (4), I (3), One (1)
PRP$ its (15), their (6)
RB also (18), not (9), often (7), commonly (5), generally (4), directly (3), only (3), primarily (3), rapidly (3), sometimes (3), then (3), together (3), widely (3), However (2), Still (2), about (2), as (2), back (2), e.g. (2), first (2), highly (2), mathematically (2), much (2), putatively (2), rather (2), thereby (2), usually (2), well (2), 11th (1), 78th (1), Apart (1), Conversely (1), Further (1), Once (1), Perhaps (1), Shortly (1), after (1), analytically (1), around (1), basically (1), broadly (1), chemically (1), closely (1), computationally (1), continuously (1), correctly (1), dramatically (1), effectively (1), efficiently (1), especially (1), essentially (1), even (1), far (1), flexibly (1), formerly (1), gradually (1), immediately (1), increasingly (1), indirectly (1), inherently (1), initially (1), instead (1), just (1), loosely (1), necessarily (1), normally (1), notably (1), now (1), principally (1), rarely (1), respectively (1), second (1), separately (1), substantially (1), super (1), typically (1), ultimately (1), universally (1), up (1)
RBR more (5), Later (1), less (1), longer (1)
RBS most (3)
RP up (2)
SYM / (4)
TO to (32)
VB be (23), include (5), Refer (3), see (3), have (2), pay (2), repay (2), achieve (1), act (1), become (1), believe (1), cause (1), charge (1), climate (1), conduct (1), consist (1), denote (1), develop (1), distinguish (1), encourage (1), engage (1), ensure (1), examine (1), exist (1), form (1), identify (1), induce (1), let (1), locate (1), measure (1), model (1), move (1), operate (1), place (1), provide (1), reduce (1), reflect (1), reimburse (1), require (1), return (1), run (1), serve (1), solve (1), specify (1), study (1), support (1), work (1)
VBD was (6), were (4), had (3), called (2), caused (2), made (2), took (2), accepted (1), added (1), appeared (1), applied (1), authored (1), became (1), began (1), built (1), carried (1), changed (1), deposited (1), did (1), dominated (1), emerged (1), led (1), provoked (1), referred (1), replaced (1), termed (1)
VBG including (5), being (3), making (3), According (2), Acting (2), developing (2), existing (2), increasing (2), lending (2), maintaining (2), using (2), writing (2), Lying (1), acquiring (1), analyzing (1), ascending (1), catalyzing (1), compiling (1), comprising (1), concerning (1), consisting (1), containing (1), contributing (1), corresponding (1), creating (1), describing (1), disposing (1), emphasizing (1), encompassing (1), establishing (1), explaining (1), funding (1), gaining (1), generating (1), granting (1), identifying (1), implementing (1), issuing (1), living (1), loaning (1), meaning (1), operating (1), reaching (1), refining (1), replicating (1), resolving (1), responding (1), resulting (1), taking (1), transporting (1), underlying (1), varying (1)
VBN used (11), known (8), based (6), called (6), been (3), considered (3), developed (3), directed (3), made (3), provided (3), applied (2), attached (2), credited (2), defined (2), degraded (2), designed (2), encoded (2), estimated (2), expanded (2), followed (2), performed (2), recorded (2), regarded (2), traded (2), accepted (1), added (1), adopted (1), affected (1), associated (1), authorized (1), balanced (1), blamed (1), bonded (1), closed (1), coded (1), collected (1), composed (1), constructed (1), contrasted (1), deferred (1), denominated (1), denoted (1), deposited (1), derived (1), described (1), devolved (1), dictated (1), disputed (1), divided (1), done (1), drawn (1), electrified (1), embodied (1), employed (1), enforced (1), equipped (1), established (1), evidenced (1), evolved (1), extended (1), formalized (1), formed (1), fueled (1), funded (1), given (1), headquartered (1), institutionalised (1), intended (1), lent (1), manufactured (1), measured (1), misfolded (1), modified (1), obligated (1), observed (1), organized (1), oriented (1), owed (1), packaged (1), perceived (1), played (1), powered (1), promulgated (1), propelled (1), recycled (1), referenced (1), regulated (1), related (1), risen (1), seen (1), started (1), studied (1), surrounded (1), taken (1), targeted (1), traced (1), transcribed (1), undirected (1), used. (1), weighed (1), withdrawn (1)
VBP are (39), include (7), have (6), focus (2), perform (2), 'm (1), add (1), associate (1), connect (1), define (1), depend (1), differ (1), draw (1), emphasize (1), exchange (1), exist (1), hold (1), identify (1), increase (1), provoke (1), study (1), value (1), viewpoint (1)
VBZ is (80), has (10), allows (3), provides (3), considers (2), describes (2), does (2), includes (2), receives (2), refers (2), represents (2), specifies (2), takes (2), uses (2), alters (1), arranges (1), begins (1), behaves (1), borrows (1), concerns (1), consists (1), contains (1), covers (1), creates (1), delivers (1), dependencies (1), detects (1), determines (1), encompasses (1), entails (1), explores (1), focuses (1), forms (1), generates (1), happens (1), informs (1), investigates (1), involves (1), lies (1), means (1), oligopeptides (1), pays (1), permits (1), ranges (1), results (1), shares (1), shows (1), specializes (1), suggests (1), traces (1), varies (1)
WDT which (31), that (17)
WP What (2), who (1)
WRB where (2), wherever (1)
XX (s) (1)
```` " (17)

And here are the bugs:

  • by using .getWordForm(), .getSimplifiedWordForm() or .getLowerSimplifiedWordForm(), the tokens at the end of a sentence (or, the tokens followed by a DOT . token) get the . appended. Examples: 1986., worldwide., etc. Is this a bug or is it done on purpose?
  • the type of parenthesis is lost: [ and ( become -LRB-, while ] and ) become -RRB-

And the following are probably only due to the specific set of documents used for training (so not really bugs):

  • i.e., etc, governance, societies are recognised as Foreign Words FW
  • worldwide is recognised as COMMA ,
  • :-P is not recognised as a smiley
  • phonetics are not properly recognised (but they are somehow common in Wikipedia pages, is it worthy to consider them in the future?)

As a side note, I suggest to always load all the files for the default dictionary even if it eats a lot of memory (12Gb on my machine) and it takes quite some time when you can't train your own dictionary, because the quality of the NLP improves significantly.

I hope this helps and thanks for maintaining this framework and keep up the great work!

Tokenization

Good practice in tokenization is to make the tokenization pipeline information preserving, in the sense that you can always recover the original form of the input document, including details of whitespace and other formatting and encoding details. If you do that, you can anchor every annotation (morphs, words, pos-tags, dependencies, named entities and SRL) back to locations in the original input. You have to be a little indirect sometimes, because it is no longer OK to directly change the input data. You have to record the change, and keep the original, while presenting the right view to later stages in the pipeline.

The ClearNLP tokenizer (I believe) loses information about whitespace in tokenize.AbstractTokenizer.tokenizeWhitespace, and is fairly free in tinkering with the input. This would be good to fix. If there is interest, I'll try.

DS results in Java exception

Java throws illegal argument exception when trying to sort List<ObjectDoublePair<double[]>>:

java.lang.IllegalArgumentException: Comparison method violates its general contract!
at java.util.TimSort.mergeHi(TimSort.java:895)
at java.util.TimSort.mergeAt(TimSort.java:512)
at java.util.TimSort.mergeCollapse(TimSort.java:437)
at java.util.TimSort.sort(TimSort.java:241)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1454)
at java.util.Collections.sort(Collections.java:175)
at edu.emory.clir.clearnlp.relation.parameter.MainEntityExtractorParamenterSearch.search(MainEntityExtractorParamenterSearch.java:100)
at edu.emory.clir.clearnlp.relation.parameter.MainEntityExtractorParamenterSearch.run(MainEntityExtractorParamenterSearch.java:82)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Comparison of ClearNLP 2.0 and ClearNLP 3.1

  1. Missing "nsubj" dependency with correct POS tagging. I've noticed that the "nsubj" dependency disappeared in some cases where the subject is relatively far from the predicate. It worked fine in ClearNLP 2.0. For example, there is no nsubj(stare, People) dependency in these sentences:

People_NNS I_PRP 've_VBP known_VBN for_IN years_NNS ,, who_WP I_PRP used_VBD to_TO greet_VB in_IN passing_VBG on_RP the_DT street_NN ,, now_RB stare_VB at_IN me_PRP with_IN a_DT mixture_NN of_IN fear_NN and_CC hatred_NN ._.

People_NNS ,, who_WP I_PRP used_VBD to_TO greet_VB in_IN passing_VBG on_RP the_DT street_NN ,, now_RB stare_VB at_IN me_PRP with_IN a_DT mixture_NN of_IN fear_NN and_CC hatred_NN ._.

But it comes back if I remove the clause:

People_NNS I_PRP 've_VBP known_VBN for_IN years_NNS now_RB stare_VBP at_IN me_PRP with_IN a_DT mixture_NN of_IN fear_NN and_CC hatred_NN ._.

  1. Missing "nsubj" dependency with incorrect POS tagging. The ClearNLP 2.0 dependency parser was able to provide correct dependency trees even if the POS tagging was incorrect. For example, in the cases below the parser used to recognize the predicate. With the new version the predicates went missing:

Trends_NNS that_WDT are_VBP prevalent_JJ today_NN ,, regardless_RB of_IN industry_NN ,, matter_NN to_IN everyone_NN ._.
Used to have: nsubj(matter, Trends)

Place_NN a_DT large_JJ sheet_NN of_IN parchment_NN paper_NN on_IN a_DT work_NN surface_NN ,, and_CC dust_NN the_DT parchment_NN lightly_RB with_IN flour_NN ..
Used to have: dobj(place, sheet)

As_IN Babe_NNP enters_VBZ the_DT Olympic_JJ stadium_NN ,, every_DT person_NN cheers_NNS and_CC yells_NNS her_PRP$ name_NN ..
Used to have: nsubj(yells, person)
Used to have: dobj(yells, name)

That_DT is_VBZ ,, communism_NN functions_NNS as_IN the_DT negation_NN of_IN alienation_NN ,, which_WDT in_IN turn_NN ,, alienation_NN is_VBZ the_DT negation_NN of_IN man_NN ..
Used to have: nsubj(functions, communism)

The_DT object_NN wo_MD n't_RB move_VB till_IN an_DT external_JJ force_NN acts_NNS on_IN it_PRP ._.
Used to have: nsubj(acts, force)

NB! Yet in some cases the dependencies remained unchanged:

Wendy_NNP thinks_NNS they_PRP 're_VBP frightened_VBN ._.
nsubj(thinks, Wendy)

Every_DT opportunity_NN and_CC option_NN changes_NNS us_PRP ._.
nsubj(changes, opportunity)
nsubj(changes, option)
dobj(changes, us)

Must_MD be_VB very_RB busy_JJ because_IN he_PRP does_VBZ not_RB answers_NNS my_PRP$ e-mail_NN or_CC phone_NN call_NN ,, but_CC he_PRP talk_VBP two_CD weeks_NNS ago_RB ,, he_PRP needed_VBD some_DT more_JJR time_NN ._.
nsubj(answers, he)
dobj(answers, e-mail)
dobj(answers, call)

If_IN you_PRP suddenly_RB notice_VBP that_IN someone_NN is_VBZ suspiciously_RB interested_JJ for_IN the_DT diamond_NN jewelery_NN display_NN ,, alert_NN the_DT security_NN ..
dobj(alert, security)

One_CD example_NN of_IN this_DT natural_JJ laws_NNS is_VBZ that_IN everything_NN changes_NNS except_IN of_IN change_NN ,, itself_PRP ..
nsubj(changes, everything)

Here is an interesting example. In the first sentence, the predicate and its dependencies are present, yet they are gone in the second case. ClearNLP 2.0 returned the correct dependencies in both cases:

Mary_NNP thanks_NNS you_PRP for_IN the_DT ocasional_JJ brotherly_JJ concerns_NNS in_IN the_DT form_NN of_IN wake-up-calls_NNS ._.
Present:
nsubj(thanks, she)
dobj(thanks, you)

She_PRP thanks_NNS you_PRP for_IN the_DT ocasional_JJ brotherly_JJ concerns_NNS in_IN the_DT form_NN of_IN wake-up-calls_NNS ._.
Gone:
nsubj(thanks, she)
dobj(thanks, you)

A different case of ambiguity that the previous version of ClearNLP was able to deal with:

By_IN 1778_CD ,, the_DT Prince_NNP is_VBZ paying_VBG for_IN full_JJ theater_NN seasons_NNS ;: opera_NN directing_VBG and_CC composing_VBG become_VBN Hayd_NNP n_NN 's_POS full-time_JJ job_NN ._.
Used to have: nsubj(become, composing)

  1. Missing "nsubj" with correct POS tagging and an error in the sentence. The interesting thing here is that now I get a "poss" dependency for a modal verb (proved by the "aux" dependency), which is grammatically impossible.

I_PRP know_VBP your_PRP$ might_MD be_VB hesitant_JJ to_TO teach_VB ,, but_CC if_IN you_PRP teach_VBP God_NNP 's_POS word_NN it_PRP will_MD be_VB beneficial_JJ to_IN these_DT kids_NNS ..
Used to have: nsubj(be, your)
Now: poss(might, your) but still aux(be, might)

I_PRP think_VBP your_PRP$ must_MD be_VB very_RB good_JJ in_IN English_NNP language_NN ._.
Used to have: nsubj(be, your)
Now: poss(must, your) but still aux(be, must)

I_PRP wish_VBP your_PRP$ will_MD find_VB a_DT better_JJR one_NN ._.
Used to have: nsubj(find, your)
Now: poss(will, your) but still aux(find, will)

Some cases remained unchanged, though:

I_PRP hope_VBP your_PRP$ are_VBP okay_JJ !_.
nsubj(are, your)

I_PRP think_VBP your_PRP$ were_VBD right_JJ about_IN the_DT public_JJ media_NNS ._.
nsubj(were, your)

  1. The "appos" dependency went missing. It was present with ClearNLP 2.0:

Anna_NNP ,, pregnant_JJ 29-year_JJ old_JJ Connecticut_NNP social_JJ worker_NN ,, is_VBZ at_IN home_NN ._.
Used to have: appos(Anna, worker)
Now:
no relation between "Anna" and "worker".
nsubj(is, Anna)
nsubj(is, worker)

Peter_NNP ,, pigeon-toed_JJ penguin_NN ,, was_VBD a_DT nice_JJ guy_NN ,, who_WP Bucky_NNP knew_VBD he_PRP could_MD always_RB count_VB on_IN ..
Used to have: appos(Peter, penguin)
Now:
no relation between "Peter" and "penguin".
nsubj(was, Peter)
nsubj(was, penguin)

There_EX are_VBP many_JJ free_JJ out_IN there_RB ,, pick_VB one_NN you_PRP like_VBP or_CC if_IN you_PRP have_VBP no_DT idea_NN my_PRP$ tip_NN would_MD be_VB to_TO download_VB Picasa_NNP ,, free_JJ photo_NN digital_NN software_NN ._.
Used to have: appos(Picasa, software)
Now:
no relation between "Picasa" and "software".

An interesting case: the "appos" dependency comes back if I remove the quotation marks:

``" The_DT hangman_NN ,, grey-haired_JJ convict_NN in_IN the_DT white_JJ uniform_NN of_IN the_DT prison_NN ,, was_VBD waiting_VBG beside_IN his_PRP$ machine_NN .. ''_"
Used to have: appos(hangman, convict)
Now:
nsubj(waiting, convict)
nsubj(waiting, hangman)

The_DT hangman_NN ,, grey-haired_JJ convict_NN in_IN the_DT white_JJ uniform_NN of_IN the_DT prison_NN ,, was_VBD waiting_VBG beside_IN his_PRP$ machine_NN ._.
appos(hangman, convict)
nsubj(waiting, hangman)

  1. Just one more error I came across:

Let_VB me_PRP know_VB if_IN those_DT dates_NNS work_VB for_IN you_PRP and_CC we_PRP 'll_MD get_VB you_PRP ticket_NN ._.
poss(ticket, you)

3.2.0: Version info not updated

Really a minor issue but got me a bit confused when setting up the 3.2.0 version without maven:

edu.emory.clir.clearnlp.bin.Version of 3.2.0 still says its version 3.2.2

Data models loading in version 3.0.0

Hi,
I am trying to use the new version but it looks like I cannot load the dictionary model, the postagger model and the dependency model (In version 2 was sufficient to have the data models in the pom files and you did not have to specify any path). Could you provide me with a simple example with maven.
Apparently my app is not able to find the path to the data model in the jars. I tried also to unpack the jar and use the gzip file but it does not work either.
Thanks

Portuguese Support

Using the Floresta Sintática corpus, or converting it to Treebank format, it's possible to use ClearNLP, since in TLanguage "Portuguese" isn't listed?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.