Comments (13)
thanks for reporting this.
I just fixed it, could you try again?
from pydepta.
Thank you!
Now I get something but the the output seems to be different than the example
from flask import Flask, request, render_template
from pydepta import Depta
app = Flask(__name__)
@app.route('/')
def pydepta():
url = request.args.get('url')
print url
if url:
depta = Depta()
regions = depta.extract(url='http://www.iens.nl/restaurant/10545/enschede-rhodos')
a_region = depta.infer(regions[8], url='http://www.iens.nl/restaurant/34397/apeldoorn-de-boschvijver')
regions = a_region
tables = [[i, region.as_html_table().decode('utf-8')] for i, region in enumerate(regions)]
return render_template('tables.html', tables=tables)
else:
return render_template('index.html')
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5444, debug=True)
It produces this....is this correct, it doesn't look like the one in the example.
Also, it seems to place a row side by side? How to make it one row on each line?
Thanks again!
from pydepta.
also, I don't understand what infer() is supposed to do. does it take the diff? does it figure out the data fields?
from pydepta.
while extract() works well, the infer seems to bring about even more erratic behavior. For instance, when extract() works, infer() doesn't work for some sites (no tables returned when using infer) or only very little amount of rows is produced.
from flask import Flask, request, render_template
from pydepta import Depta
app = Flask(__name__)
@app.route('/')
def pydepta():
url = request.args.get('url')
print url
if url:
depta = Depta()
regions = depta.extract(url='http://www.amazon.ca/s/ref=lp_916520_nr_n_0?rh=n%3A916520%2Cn%3A%21927726%2Cn%3A933484&bbn=927726&ie=UTF8&qid=1386501729&rnid=927726')
a_region = depta.infer(regions[16], url='http://www.amazon.ca/s/ref=lp_933484_pg_2?rh=n%3A916520%2Cn%3A%21927726%2Cn%3A933484&page=2&ie=UTF8&qid=1386501736')
regions = a_region
tables = [[i, region.as_html_table().decode('utf-8')] for i, region in enumerate(regions)]
return render_template('tables.html', tables=tables)
else:
return render_template('index.html')
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5444, debug=True)
This produces an output like.
from pydepta.
Hi,
It seems like the depta treat every 2 items as a group (similarity >= default threshold and can find larger data record). that's why it different from example
and have 2 items in 1 row.
On Dec 8, 2013, at 7:11 PM, yuyuyaya [email protected] wrote:
Thank you!
Now I get something but the the output seems to be different than the example
from flask import Flask, request, render_template
from pydepta import Deptaapp = Flask(name)
@app.route('/')
def pydepta():
url = request.args.get('url')
print url
if url:
depta = Depta()
regions = depta.extract(url='http://www.iens.nl/restaurant/10545/enschede-rhodos')
a_region = depta.infer(regions[8], url='http://www.iens.nl/restaurant/34397/apeldoorn-de-boschvijver')
regions = a_region
tables = [[i, region.as_html_table().decode('utf-8')] for i, region in enumerate(regions)]
return render_template('tables.html', tables=tables)
else:
return render_template('index.html')if name == 'main':
app.run(host='0.0.0.0', port=5444, debug=True)It produces this....is this correct, it doesn't look like the one in the example.
Also, it seems to place a row side by side? How to make it one row on each line?
Thanks again!
—
Reply to this email directly or view it on GitHub.
from pydepta.
the infer is supposed to find the data records on similar pages (similar to the page which seed is extracted from) even the data record has only 1 item.
(the DEPTA assume the page has at least 2 data record, otherwise similarity won't work. so the infer is to intended to fix this limit)
On Dec 8, 2013, at 7:12 PM, yuyuyaya [email protected] wrote:
also, I don't understand what infer() is supposed to do. does it take the diff? does it figure out the data fields?
—
Reply to this email directly or view it on GitHub.
from pydepta.
it seems these 2 pages are not similar. that's why infer not works
On Dec 8, 2013, at 7:31 PM, yuyuyaya [email protected] wrote:
while extract() works well, the infer seems to bring about even more erratic behavior. For instance, when extract() works, infer() doesn't work for some sites (no tables returned when using infer) or only very little amount of rows is produced.
from flask import Flask, request, render_template
from pydepta import Deptaapp = Flask(name)
@app.route('/')
def pydepta():
url = request.args.get('url')
print url
if url:
depta = Depta()
regions = depta.extract(url='http://www.amazon.ca/s/ref=lp_916520_nr_n_0?rh=n%3A916520%2Cn%3A%21927726%2Cn%3A933484&bbn=927726&ie=UTF8&qid=1386501729&rnid=927726')
a_region = depta.infer(regions[16], url='http://www.amazon.ca/s/ref=lp_933484_pg_2?rh=n%3A916520%2Cn%3A%21927726%2Cn%3A933484&page=2&ie=UTF8&qid=1386501736')
regions = a_region
tables = [[i, region.as_html_table().decode('utf-8')] for i, region in enumerate(regions)]
return render_template('tables.html', tables=tables)
else:
return render_template('index.html')if name == 'main':
app.run(host='0.0.0.0', port=5444, debug=True)This produces an output like.
—
Reply to this email directly or view it on GitHub.
from pydepta.
hi tpeng!
Thanks for the explanation.
is it possible to change the default threshold to make each row on a line?
so one should use infer() for a non-MDR (multiple data record) page and extract() for MDR page?
Thanks again!
from pydepta.
- yes. you can create a Depta instance with threshold set to other value.
- yes.
from pydepta.
- how can I do this? is there a list of arguments and methods, as there is very little documentation
from pydepta.
e.g.
from pydepta import Depta
d = Depta(threshold=0.9)
i agree this is very little document and i probably can add some later.
from pydepta.
I am still having trouble with infer()
consider the following code, its taking amazon product detail page, and it returns blank. I made sure I am using the right table index (trying to get the ISBN of the book) which is the 12th table
but the other url it's actually the 11th table (ISBN)
Is there a way to resolve this issue, both are the same looking page.
depta = Depta(threshold=0.9)
regions = depta.extract(url='http://www.amazon.ca/Flood-2013-Summer-Southern-Alberta/dp/1771640308/ref=sr_1_17?s=books&ie=UTF8&qid=1386574042&sr=1-17')
a_region = depta.infer(regions[12], url='http://www.amazon.ca/Earth-Spirit-Place-Featuring-Photographs/dp/1894673670/ref=sr_1_18?s=books&ie=UTF8&qid=1386574042&sr=1-18')
regions = a_region
from pydepta.
Hi @yuyuyaya ,
I'm working on new infer
. it will use Scrapely
for extracting structured data. you can find the changes on https://github.com/tpeng/pydepta/tree/infer-with-scrapely.
it's still understand WIP and it also need some patches to Scrapely
. but hopefully i can finish it soon.
stay tuned!
Thanks
from pydepta.
Related Issues (7)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pydepta.