Giter Site home page Giter Site logo

Comments (7)

NightOwl888 avatar NightOwl888 commented on July 29, 2024

There are 2 different issues here.

First of all, it says specifically in the documentation that SimpleFacetHandler requires its field to be non-tokenized.

Second, from my understanding facet handlers are designed to be used internally by the BoboBrowser, not to be called directly. Nearly every use case with BoboBrowse calls BoboBrowser.Browse(). I am not sure exactly what your use case is here, but if you are trying to read back field data from the index, you would need to use Lucene for that. I found this article that demonstrates how to do it.

// Open index.
using (Lucene.Net.Store.Directory idx = FSDirectory.Open(new System.IO.DirectoryInfo(indexPath)))
{
    using (IndexReader reader = IndexReader.Open(idx, true))
    {
        TermEnum terms = reader.Terms(new Term("text"));
        for (Term term = terms.Term; term != null; terms.Next(), term = terms.Term)
        {
            var value = term.Text;
            Console.WriteLine(value);
        }
    }
}

from bobobrowse.net.

dtosato avatar dtosato commented on July 29, 2024

I am testing the fast field retrieval functionality.
From Lucene in Action Second Ed.:

Bobo Browse can retrieve the field values for a specific document ID and field name.
With Lucene, a Field with the Store.YES attribute turned on can be stored:
doc.add(new Field("color","red",Store.YES,Index.NOT_ANALYZED_NO_NORMS));
then retrieved via the API:
Document doc = indexReader.document(docid);
String colorVal = doc.get("color");
We devised a test to understand performance better. We created an optimized index
of 1 million documents, each with one field named color with a value to be one of
eight color strings. This was a very optimized scenario for retrieval of stored data
because there was only one segment and much of the data could fit in memory. We
then iterated through the index and retrieved the field value for color. This took
1,250 milliseconds (ms).
Next, we did the same thing, but instead of creating a BoboIndexReader with a
FacetHandler, we built on the indexed data of the color field. We paid a penalty of 133
ms to load the FacetHandler once the index loads, and retrieval time took 41 ms. By
paying a 10 percent penalty once, we boosted the retrieval speed over 3,000 percent.

This functionality was working into the @zhengchun porting independently from the field analysis.

from bobobrowse.net.

NightOwl888 avatar NightOwl888 commented on July 29, 2024

The example from that chapter functions as expected:

public void Test()
{
    Stopwatch stopWatch = new Stopwatch();

    // Create index.
    string indexPath = @"c:\temp\test";
    int numDocs = 10;
    using (Lucene.Net.Store.Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo(indexPath)))
    {
        string text;
        using (IndexWriter modifier = new IndexWriter(directory, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30), true, Lucene.Net.Index.IndexWriter.MaxFieldLength.UNLIMITED))
        {
            for (int i = 0; i < numDocs; i++)
            {
                text = "asisdfowefposjverovn";
                Document doc = new Document();
                doc.Add(new Field("text", text, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
                modifier.AddDocument(doc);
            }
        }
    }

    // Facets.
    IFacetHandler textFacetHandler = new SimpleFacetHandler("text");
    ICollection<IFacetHandler> handlerList = new IFacetHandler[] { textFacetHandler };

    // Open index.
    using (Lucene.Net.Store.Directory idx = FSDirectory.Open(new System.IO.DirectoryInfo(indexPath)))
    {
        using (IndexReader reader = IndexReader.Open(idx, true))
        {
            // Open bobo reader.
            using (BoboIndexReader boboReader = BoboIndexReader.GetInstance(reader, handlerList))
            {
                // Extract text.
                int n = reader.NumDocs();
                for (int i = 0; i < n; i++)
                {
                    Document doc = boboReader.Document(i);
                    string field = doc.Get("text");
                    Assert.IsFalse(string.IsNullOrEmpty(field));
                }
            }
        }
    }
}

I looked through the book, but at no point can I find an example where they are calling the facet handler directly.

However, I spotted the problem with your code. The 3.2.0 BoboIndexReader design has changed in that it now has a collection of subreaders nested within the parent reader. On a single browse request, there is just 1 subreader at index 0. So, you could change your code to:

for (int i = 0; i < n; i++)
{
    string field = textFacetHandler.GetFieldValue(boboReader.SubReaders[0], i);
    Assert.IsFalse(string.IsNullOrEmpty(field));
}

Still, I would recommend calling the Document() method as specified in the book because it will shield you from any future design changes or problems arising from multi-browse requests that have more than one subreader.

from bobobrowse.net.

dtosato avatar dtosato commented on July 29, 2024

Usually, I have more than 1 subreader because my lucene index has many segments.
Could you implement a solution like this? Is it already implemented and it does not work?
Do you know if using the Document() method it is possible to obtain the gain in performance as described in Lucene in Action Second Ed.?

from bobobrowse.net.

NightOwl888 avatar NightOwl888 commented on July 29, 2024

Oh, I just noticed that the SubReaders property is internal, so it cannot be used this way. That was a design decision in Java that was copied over into .NET.

The link you provided is to the original Java source of BoboBrowse 3.2.0, and it corresponds to the GetFieldValue method in this repository. As you can see, it is implemented exactly the same way. But it requires the BoboIndexReader to be a subreader in order to function.

If you have a look at the BoboIndexReader.Document(int) method, you can see how it handles the subreader logic and if the reader itself is a subreader calls GetFieldValues, which must be overridden in the FacetHandler. The exact way it is implemented depends on the facet handler. SimpleFacetHandler looks like this:

public override string[] GetFieldValues(BoboIndexReader reader, int id)
{
    FacetDataCache dataCache = GetFacetData<FacetDataCache>(reader);
    if (dataCache != null)
    {
        return new string[] { dataCache.ValArray.Get(dataCache.OrderArray.Get(id)) };
    }
    return new string[0];
}

As for performance gains over Lucene.Net, you will need to benchmark it for yourself with a similar sample size of 1 million documents and 8 evenly distributed values. There might not be as much of a difference, because of differences in the way that .NET and Java manage memory and data types. But it also depends on how Lucene.Net optimized the field retrieval compared to its Java counterpart.

from bobobrowse.net.

dtosato avatar dtosato commented on July 29, 2024

I made the test and I found that Bobo is much slower than Lucene. Thus, using Document() is not equivalent to GetFieldValue().
For the moment I consider the fast field retrieval not available. This is not a big deal for me, but let me know if you want to improve this aspect of your solution.

from bobobrowse.net.

NightOwl888 avatar NightOwl888 commented on July 29, 2024

Keep in mind, Lucene in Action, Second Edition was published in 2010. BoboBrowse 3.2.0 was released on March 23, 2013. So much of what is in Lucene in Action is out of date with respect to BoboBrowse. BoboBrowse 2.5 was released in mid 2010, so it is difficult to tell if they are basing those results on 2.0 or 2.5.

That said, it is not that surprising that the results are different. I ran a few experiments:

  1. I changed the PredefinedTermListFactory to use a switch case statement instead of Reflection to create the TermXXXLists.
  2. I ran a test to see what would happen if I changed the type of the BoboIndexReader _facetDataMap member variable from IDictionary<string, object> to IDictionary<string, Interface>.
  3. I compared the performance of using a local variable to store an object (as was done in 2.0) with the performance of storing the object in an IDictionary<string, object> in a separate object.

The first two experiments gave no significant performance improvement. The third showed that the way the data cache was stored in 2.0 is about 18x faster to access then the way it is done in 3.2.0.

I also saw that just a single browse operation (the BoboTestCase.TestDate test) calls the method to access the data cache several hundred times (308 in the test), so optimizing it could produce a significant impact on performance. Unfortunately, I was unable to find a method that works faster than the way it is currently done.

Clearly there were some trade offs in the 3.0 design. Access to the centrally located data cache is slower. But putting the cache data in a central location is necessary to support the custom sorting functionality. The dictionary the cache is stored in is also put into a thread-safe wrapper, which again slows things down a bit as a trade for thread safety.

However, performance still seems more than adequate for basic browse scenarios. I haven't run any of the unit tests for BoboBrowse 3.2.0 in Java, which would be a true apples-to-apples comparison, but based on my experiments it looks like it is about as optimized as it can be. I don't see that there is much of a chance to get it running faster than it does without significantly changing the design or compromising thread safety.

But let me know if you find any bottlenecks that could be improved upon.

from bobobrowse.net.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.