Giter Site home page Giter Site logo

newsindexer's People

Contributors

nicklondhe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

newsindexer's Issues

Bug in ParserTest.java

In the method validateAuthorOrg

In case the value of authorOrg is null the assertNull always fails as the asserNull compares the address of the String array rather than the value.

The code is as follows:
private void validateAuthorOrg(Document d, int count){
String authorOrg = authororgs[count];

    if (authorOrg == null) {
        assertNull(d.getField(FieldNames.AUTHORORG));
    } else {
        assertEquals(authorOrg,
            d.getField(FieldNames.AUTHORORG)[0]);   
    } 
}

The corrected code is as follows:
private void validateAuthorOrg(Document d, int count){
String authorOrg = authororgs[count];

    if (authorOrg == null) {
        assertNull(d.getField(FieldNames.AUTHORORG)[0]);
    } else {
        assertEquals(authorOrg,
            d.getField(FieldNames.AUTHORORG)[0]);   
    } 
}

The same bug is applicable to validateAuthor method as well.

Bug in SymbolRuleTest

As per as one of the rule "Any punctuation marks that possibly mark the end of a sentence (. ! ?) should be removed. Obviously if the symbol appears within a token it should be retained (a.out for example)."

Since Opening and Closing brackets "()" are also considered as punctuation mark, so test case rule of f'(x) => f(x) should not yield f(x). Ideally it should yield f(x

Error getAnalyzedTerm test case

While testing the indexer, TokenStream is not being updated in getAnalyzedTerm.

                Analyzer analyzer = fact.getAnalyzerForField(FieldNames.CONTENT, stream);

          while (analyzer.increment()) {  }//Stream is not updated.

                 stream.reset(); return stream.next().toString();

As seen above stream is not updated and directly used again. Is this a bug ? Or should the analyzer work on the same object and not create a new TokenStream internally. Am a bit confused here.
On UBlearns, you have said the following should be the ideal case. Please let me know if the change will be incorporated in the final version you use to test.

Indexer test works properly for this code

     Analyzer analyzer = fact.getAnalyzerForField(FieldNames.CONTENT, stream); 

     stream = analyzer.getStream(); 

    while (analyzer.increment()) {  } 

    stream.reset(); 

    return stream.next().toString();

Bug in IndexerTest

The method setupIndex() in IndexerTest has annotation BeforeClass and hence should be a static method.
Also, the variable reader will have to be made static to be accessed inside setupIndex()

Test Bugs

  1. TokenStreamTest getCurrent()

//null at end
assertFalse(stream.hasNext());
assertNull(stream.getCurrent());

  • getCurrent still points to last element returned by next() so possibly add next() between asserts.
  1. TokenStreamTest getCurrent()
    while (stream.hasNext()) {
    tNext = stream.next();

    for (int i = 0; i < 5; i++) {
    tCurrent = stream.getCurrent();
    assertTrue(stream.hasNext());
    assertEquals(tNext, tCurrent);
    }
    }

at some point tNext points to last element so assertTrue(stream.hasNext()); will fail.

  1. TokenizerTest
    TokenStream ts = spaceTknizer.consume("This is a longer test.");
    assertArrayEquals(new String[]{"This", "is", "a", "longer", "test"}, serializeStream(ts));

if our delimiter is a whitespace then last token should be "test." not "test" - its gonna be a responsibility of SymbolTokenFilter to remove "."

  1. TokenTest
    invokeMege - keeps failing with "wrong number of arguments" description. Never gets to actual call of merge within token object.

IndexerTest.testQuery

I believe your intersect method is modifying the original index hashmaps, on the third iteration the occurances are 4 for docs 2,3,4 when the should be 3.

Given docs:
String[] strs = {"new home sales top sales forecasts", "home sales rise in july",
"increase in home sales in july", "july new home sales rise"};
Query: sales", "home", "july"

image

Bugs in TokenStreamTest

There's 2 bugs in testGetCurrent() - or my understanding of the getCurrent method is wrong

  1. stream.getCurrent() is expected to return null when the stream is sitting at the last index. It should just return the last element.
    [https://github.com/nicklondhe/newsindexer/blob/master/edu/buffalo/cse/irf14/analysis/test/TokenStreamTest.java#L267]
  2. Again, tCurrent, i.e. stream.getCurrent() is expected to be null. "is" has been removed before this step from "this is a test", so why is getCurrent expected to return null?
    [https://github.com/nicklondhe/newsindexer/blob/master/edu/buffalo/cse/irf14/analysis/test/TokenStreamTest.java#L295]

IndexerTest.testQuery

The dummy index you use has docs in a map with 0 occurances. for intersections of postings list, the size is wrong because of my previous statement.

line 182 : iteration i=2

expected size 4 when the correct size after intersection is 3. This is because of a posting of doc 1 with o occurances exist.

image

prepareIndex fails in IndexerTest

java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Ljava.util.HashMap;
at edu.buffalo.cse.irf14.index.test.IndexerTest.prepareIndex(IndexerTest.java:247)
at edu.buffalo.cse.irf14.index.test.IndexerTest.testQuery(IndexerTest.java:173)

Is anyone else getting this ?

Bugs in SpecialCharRuleTest

For the test case - "email is [email protected]" - the @ symbol is used to split the token while in the test case "a+b-c" we're just dropping the special char.
Please confirm the expectation from the tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.