Best practice to "chunk" a full data pull,about nationalassociationofrealtors/librets

Comments (6)

ktgeek commented on August 16, 2024

This question is better asked on the libRETS-users mailing list. Lots of people experienced with all sorts of wackiness there.

This is more of a general RETS question than a libRETS issue.

Keith T. Garner
[email protected]:[email protected] 312-329-3294tel:312-329-3294
National Association of REALTORS®
VP - Information Technology Services

On Nov 25, 2014, at 4:27 PM, Mike Sparr <[email protected]mailto:[email protected]> wrote:

What is the best way to grab the full set and also prevent nightmares if an MLS decides to update the modified timestamp for the entire database in a single day? (Python snippet example)

Using python's binding for librets, and attempting to iterate through metadata resource/class and pull all records for that class for the initial pull, I come across a market for example with 43K records with (L_UpdateDate=1957-01-01T00:00:00+) and it will only retrieve 2500 at a time.

    request.SetLimit(librets.SearchRequest.LIMIT_DEFAULT) # changing these does nothing on some
    request.SetOffset(librets.SearchRequest.OFFSET_NONE) # changing these does nothing on some
    request.SetCountType(librets.SearchRequest.RECORD_COUNT_AND_RESULTS)
    request.SetFormatType(librets.SearchRequest.COMPACT)

    # perform query
    t_start = time.strftime('%Y-%m-%d %H:%M:%S')
    results = session.Search(request)
    record_count = results.GetCount()

    print "Record count: " + `record_count`
    print
    columns = results.GetColumns()

    while results.HasNext():
        rec = {'request-id':request_id,'data':{}}
        for column in columns:
            rec['data'][column] = results.GetString(column).decode('utf-8')  # had to fix encoding issues

As @ktgeekhttps://github.com/ktgeek mentioned in other posts, all the MAY rules mean RETS server vendors don't have to guarantee cursor position or properly support the limit/offset. In some markets this is causing issues with missing some listings during larger pulls, or having to "chunk" results into many queries by last modified or price intervals, etc. Unfortunately we've come across some large markets that during conversions or major changes update the timestamp in a single day for a million+ records and wreak havoc on the last modified approach.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11.

from librets.

mikesparr commented on August 16, 2024

Thanks Keith for getting back so quickly.

Using libRETS would you first query request.SetCountType(librets.SearchRequest.RECORD_COUNT_ONLY) to get count and check response for MaxRows?

Then using that information recursively query each "chunk" using request.SetCountType(librets.SearchRequest.NO_RECORD_COUNT) and changing the request.SetLimit(librets.SearchRequest.LIMIT_NONE) and
request.SetOffset(chunk_size * i)

I'm curious if that is the best practice to first query total count, use that to determine chunk strategy, and if limit/offset are not working, and modified timestamp not failproof, what 'chunk' strategy is guaranteed. The latter might be users group but the prior I'd appreciate suggestions on most performant way to leverage the lib.

Also assuming the limit/offset works and query doesn't have to be used for chunking, I wanted to know if the request object can be reused for many session.Search(request) while iterating through "chunks" or if a preferred method to recursively execute the queries using the lib. If limit/offset do not work, then should you create the request object each time with new query, or is there way to just update the query of existing request?

Mainly, best practice using the lib to recursively query a result set, regardless of how we solve the "chunking" issue.

from librets.

jazzklein commented on August 16, 2024

On Nov 25, 2014, at 2:46 PM, Mike Sparr [email protected] wrote:

Thanks Keith for getting back so quickly.

Using libRETS would you first query request.SetCountType(librets.SearchRequest.RECORD_COUNT_ONLY) to get count and check response for MaxRows?

Looks like today is a RETS Spec kinda day, and let’s hope your server provider is “doing the right thing” with regard to HasKeyIndex and InKeyIndex as well as the suspension of limits when a search only includes those.

First thing I’d do is do my own “chunkification” of the data and not rely upon the server because of the excess of “MAY”s in the spec. That said …

I’d do a query that only returns the keys and store them somewhere. Then I’d iterate through that list requesting the entries. That way you can update your list to indicate what succeeded and what didn’t. That way you have a way to restart/continue if/when things go south.

Here’s some c++ code I’ve used in the past as an example. You can get fancier and construct your secondary query such that it is (keyfield=value)|(keyfield=value)|(keyfield=value) …. so as to do bulk queries. Or, just loop through as I do below:

    SearchResultSetAPtr results = session->Search(searchRequest.get());
    if (printCount)
    {
        cout << "Matching record count: " << results->GetCount() << endl;
    }

    /*
     * For all listings found, fetch the full listing detail and then the 
     * associated Photos.
     */
    while (results->HasNext())
    {
        totalListings++;
        listingIds.push_back(results->GetString(keyField));
        /*
         * Create a new search to fetch all the detail for the listing.
         */
        boost::shared_ptr<boost::thread> thd;
        thd = boost::shared_ptr<boost::thread>(new boost::thread(
                            boost::bind(
                            find_listing,
                            resource, 
                            searchClass,
                            str_stream()    << "("
                            << keyField
                            << "="
                            << results->GetString(keyField)
                            << ")")));
        threads.push_back(thd);

        thd = boost::shared_ptr<boost::thread>(new boost::thread(
                                            boost::bind(
                                            fetch_media,
                                            resource,
                                            "Photo",
                                            results->GetString(keyField))));
        threads.push_back(thd);
    }

    for (vector<boost::shared_ptr<boost::thread> >::iterator i = threads.begin(); i != threads.end(); i++)
        (*i)->join();

    cout << "Total Listings Retrieved: " << totalListings << endl;
    cout << "Listing IDs:" << endl;

    for (vector<string>::iterator i = listingIds.begin(); i != listingIds.end(); i++)
        cout << *i << endl;

from librets.

mikesparr commented on August 16, 2024

I'm making sure the MAY madness is questioned in ongoing RESO API, Data Dictionary, Transport discussions for exactly that reason. Thanks for the tip on technique. Some servers still appear to be 1.5 so won't support the key field check but I'm confident we can get by that.

Using this approach will it unnecessarily "tax" the servers with countless queries versus larger batches of results or is this the only way to guarantee accuracy?

from librets.

ktgeek commented on August 16, 2024

FBS servers do allow a “Query=*” for the approximation of a table scan. That HAS to be less taxing than repeated narrow queries. (I think it even came up back in RETS meetings before there was a thing called RESO.)

Unfortunately, the RETS vendors don’t give you a lot of other choices other than the keyfield thing. For servers that have that it is pretty much an exception to the limits. I think they only reason they are slightly “better” than other queries is that backend they are generally expected to be an indexed unique field.

For the other servers, I’ve seen people do queries like “(City=A*)” and then rotate through the alphabet. Obviously, the field to choose varies by area.

(Also, thanks for fighting the good fight! I got tired of that fight 7 years ago.)

On Nov 25, 2014, at 5:24 PM, Mike Sparr <[email protected]mailto:[email protected]> wrote:

I'm making sure the MAY madness is questioned in ongoing RESO API, Data Dictionary, Transport discussions for exactly that reason. Thanks for the tip on technique. Some servers still appear to be 1.5 so won't support the key field check but I'm confident we can get by that.

Using this approach will it unnecessarily "tax" the servers with countless queries versus larger batches of results or is this the only way to guarantee accuracy?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-64489948.

Keith T. Garner - [email protected]:[email protected] - 312-329-3294 - http://realtor.org/
National Association of REALTORS® - VP - Information Technology Services

from librets.

jazzklein commented on August 16, 2024

On Nov 25, 2014, at 3:24 PM, Mike Sparr [email protected] wrote:

Using this approach will it unnecessarily "tax" the servers with countless queries versus larger batches of results or is this the only way to guarantee accuracy?

It will be poorer performing. If you don’t mind a little code, my preference would be to do a compound query. You’d need to figure out the best mix of memory consumed vs. hard limits (if the server uses one) vs. restart-ability if something goes wrong.=

from librets.

Best practice to "chunk" a full data pull about librets HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent