dalab / web2text Goto Github PK
View Code? Open in Web Editor NEWSource code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18
License: MIT License
Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18
License: MIT License
It appears the https://pypi.org/project/future/ module is missing, or I guess there's no requirements file, or incomplete in docs...
Command:
$ python src/main/python/main.py classify result\step_1_extracted_features result/step_2_classified_labels
Error:
Traceback (most recent call last):
File "src/main/python/main.py", line 7, in <module>
from forward import EDGE_VARIABLES, UNARY_VARIABLES, edge, loss, unary
File "/home/lawrence/web2text/src/main/python/forward.py", line 4, in <module>
from config import Config
File "/home/lawrence/web2text/src/main/python/config.py", line 5, in <module>
from future.utils import iteritems
ModuleNotFoundError: No module named 'future'
Resolution:
$ pip install future
Would you mind adding a recipe to show the whole workflow from a raw HTML page to a cleaned text file?
Hello, I've been trying to get the base recipe running for a while and have found no success. The following is happening upon inputting the command run :
[info] Compiling 37 Scala sources to /home/jrojas/Projects/Extractors/web2text/target/scala-2.10/classes ... [error] /home/jrojas/Projects/Extractors/web2text/src/main/scala/ch/ethz/dalab/web2text/cdom/Node.scala:39:33: missing parameter type [error] val c = for (c <- children; l <- c.toString.lines) yield {" " + l} [error] ^ [error] /home/jrojas/Projects/Extractors/web2text/src/main/scala/ch/ethz/dalab/web2text/cleaneval/CleanEval.scala:140:54: value drop is not a member of java.util.stream.Stream[String] [error] val contents = if (f.startsWith("URL:")) f.lines.drop(1).mkString("\n") [error] ^ [error] /home/jrojas/Projects/Extractors/web2text/src/main/scala/ch/ethz/dalab/web2text/features/PageFeatures.scala:29:63: type mismatch; [error] found : java.util.stream.Stream[String] [error] required: Iterator[?] [error] (blockFeatureLabels.toIterator zip blockFeatures.toString.lines) [error] ^ [error] three errors found [error] (Compile / compileIncremental) Compilation failed [error] Total time: 5 s, completed Nov 12, 2019, 12:42:05 AM
System information:
OS: Ubuntu 18.04
SBT Version: 1.3.3
Scala Version: 2.10.4
Also tested with:
SBT Version: 0.13.7
$ python src/main/python/main.py classify result/step_1_extracted_features result/step_2_classified_labels 2021-10-26 21:30:40.002411: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 Traceback (most recent call last): File "src/main/python/main.py", line 7, in <module> from forward import EDGE_VARIABLES, UNARY_VARIABLES, edge, loss, unary File "/home/ubuntu/sandbox/alan134/web2text/src/main/python/forward.py", line 2, in <module> from tensorflow import variable_scope, convert_to_tensor ImportError: cannot import name 'variable_scope' from 'tensorflow' (/home/ubuntu/anaconda3/envs/alan-env/lib/python3.8/site-packages/tensorflow/__init__.py)
Looking up "variable_scope" I find this https://www.tensorflow.org/api_docs/python/tf/compat/v1/variable_scope.
It seems to be saying that this (and apparently a number of other methods) are deprecated and need to be replaced with something like tf.compat.v1.variable_scope
.
I would do this myself but I am not well versed in TF. Any chance of getting this updated to work with TF2?
Thanks,
Alan
I want to extract CleanEval data, and it might be explained on README, like this:
import ch.ethz.dalab.web2text.utilities.Util
import ch.ethz.dalab.web2text.cleaneval.CleanEval
import ch.ethz.dalab.web2text.output.CsvDatasetWriter
val data = Util.time{ CleanEval.dataset(fe) }
// Write block_features.csv and edge_features.csv
// Format of a row: page id, groundtruth label (1/0), features ...
CsvDatasetWriter.write(data, "./src/main/python/data")
// Print the names of the exported features in order
println("# Block features")
fe.blockExtractor.labels.foreach(println)
println("# Edge features")
fe.edgeExtractor.labels.foreach(println)
but I don't understand what "fe" is. Could you explain how to define "fe" ?
Hello,
I am trying to benchmark some models against your train/test splits of the cleaneval data (thank you for releasing it as it is not available online anymore).
However, you did not released your "data/cleaneval.npy" file, which is necessary to use your src/main/python/data.py
script.
The train/test splits are not replicable otherwise.
Thank you in advance.
I've tested on different websites so far and it is only grabbing tiny excerpts it thinks is the main content. While the text is inside the main content, it is ignoring the rest of the text that is still part of the main content.
I've used the recipe to generate the final output text. How can I tweak this so that it can grab the expected main content text?
By default, is it using pre-trained weights? How can I "teach" it so that its accuracy will improve?
So far I tested:
https://news.ycombinator = grabs only the first submission
https://openai.com/blog/openai-pytorch/ = " In the past, we implemented projects in many frameworks depending on their relative strengths. We’ve now chosen to standardize to make it easier for our team to create and share optimized implementations of our models." missing the first sentence and the rest of the text.
so I've been able to successfully extract features, it produces the output_.....csv
then I went inside the src/main/python directory installed, numpy, tensorflow, future with pip3
then from the root directory I ran python3 src/main/python/main.py classify output labelz
asdf@ubuntu-s-1vcpu-1gb-sfo2-01:~/web2text$ python3 src/main/python/main.py classify output labelz
/home/asdf/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/asdf/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/asdf/web2text/src/main/python/forward.py:2: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
Traceback (most recent call last):
File "src/main/python/main.py", line 313, in <module>
main()
File "src/main/python/main.py", line 40, in main
classify(file_base + '_block_features.csv', file_base + '_edge_features.csv', labels_output_file)
File "src/main/python/main.py", line 254, in classify
unary_logits = unary(unary_features, is_training=False)
File "/home/asdf/web2text/src/main/python/forward.py", line 19, in unary
c = Config()
File "/home/asdf/web2text/src/main/python/config.py", line 14, in __init__
root[k] = FLAGS.__getattr__(k)
File "/home/asdf/.local/lib/python3.6/site-packages/absl/flags/_flagvalues.py", line 491, in __getattr__
raise _exceptions.UnparsedFlagAccessError(error_message)
absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --logtostderr before flags were parsed.
Hi,
I am trying to train the model with a different set of data. I am following through your steps.
I have 2 questions
source
as the HTML doc and the cleaned txt for cleaned
html
files inside the input folder and dump them into one extract block and edge features csv fileI am now getting the error message at step of the recipe:
tensorflow.python.framework.errors_impl.NotFoundError: trained_model_cleaneval_split; No such file or directory
using python3 src/main/python/main.py classify output labelz
I did these steps:
python3 main.py train_unary
.but, the fourth step failed with this error:
File "main.py", line 262, in <module>
main()
File "main.py", line 28, in main
train_unary()
File "main.py", line 83, in train_unary
dropout_keep_prob=DROPOUT_KEEP_PROB)
File "/root/work/web2text/src/main/python/forward.py", line 19, in unary
c = Config()
File "/root/work/web2text/src/main/python/config.py", line 12, in __init__
for k, v in iteritems(FLAGS.__dict__['__flags']):
KeyError: '__flags'
Could you tell me how to fix it?
My tensorflow version is 1.10.0
when I import this project used idea, the IDE show the error message as follow:
[error] sbt.librarymanagement.ResolveException: unresolved dependency: ch.ethz.dalab#dissolvestruct_2.10;0.1-SNAPSHOT: not found
[error] at sbt.internal.librarymanagement.IvyActions$.resolveAndRetrieve(IvyActions.scala:331)
[error] at sbt.internal.librarymanagement.IvyActions$.$anonfun$updateEither$1(IvyActions.scala:205)
[error] at sbt.internal.librarymanagement.IvySbt$Module.$anonfun$withModule$1(Ivy.scala:229)
[error] at sbt.internal.librarymanagement.IvySbt.$anonfun$withIvy$1(Ivy.scala:190)
[error] at sbt.internal.librarymanagement.IvySbt.sbt$internal$librarymanagement$IvySbt$$action$1(Ivy.scala:70)
[error] at sbt.internal.librarymanagement.IvySbt$$anon$3.call(Ivy.scala:77)
[error] at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
[error] at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
[error] at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
[error] at xsbt.boot.Using$.withResource(Using.scala:10)
[error] at xsbt.boot.Using$.apply(Using.scala:9)
[error] at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
[error] at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
[error] at xsbt.boot.Locks$.apply0(Locks.scala:31)
[error] at xsbt.boot.Locks$.apply(Locks.scala:28)
[error] at sbt.internal.librarymanagement.IvySbt.withDefaultLogger(Ivy.scala:77)
[error] at sbt.internal.librarymanagement.IvySbt.withIvy(Ivy.scala:185)
[error] at sbt.internal.librarymanagement.IvySbt.withIvy(Ivy.scala:182)
[error] at sbt.internal.librarymanagement.IvySbt$Module.withModule(Ivy.scala:228)
[error] at sbt.internal.librarymanagement.IvyActions$.updateEither(IvyActions.scala:190)
[error] at sbt.librarymanagement.ivy.IvyDependencyResolution.update(IvyDependencyResolution.scala:20)
[error] at sbt.librarymanagement.DependencyResolution.update(DependencyResolution.scala:56)
[error] at sbt.internal.LibraryManagement$.resolve$1(LibraryManagement.scala:38)
[error] at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$12(LibraryManagement.scala:91)
[error] at sbt.util.Tracked$.$anonfun$lastOutput$1(Tracked.scala:68)
[error] at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$19(LibraryManagement.scala:104)
[error] at scala.util.control.Exception$Catch.apply(Exception.scala:224)
[error] at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11(LibraryManagement.scala:104)
[error] at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11$adapted(LibraryManagement.scala:87)
[error] at sbt.util.Tracked$.$anonfun$inputChanged$1(Tracked.scala:149)
[error] at sbt.internal.LibraryManagement$.cachedUpdate(LibraryManagement.scala:118)
[error] at sbt.Classpaths$.$anonfun$updateTask$5(Defaults.scala:2353)
[error] at scala.Function1.$anonfun$compose$1(Function1.scala:44)
[error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:42)
[error] at sbt.std.Transform$$anon$4.work(System.scala:64)
[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:257)
[error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:16)
[error] at sbt.Execute.work(Execute.scala:266)
[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:257)
[error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:167)
[error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:32)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] at java.lang.Thread.run(Thread.java:745)
[error] sbt.librarymanagement.ResolveException: unresolved dependency: ch.ethz.dalab#dissolvestruct_2.10;0.1-SNAPSHOT: not found
[error] at sbt.internal.librarymanagement.IvyActions$.resolveAndRetrieve(IvyActions.scala:331)
[error] at sbt.internal.librarymanagement.IvyActions$.$anonfun$updateEither$1(IvyActions.scala:205)
[error] at sbt.internal.librarymanagement.IvySbt$Module.$anonfun$withModule$1(Ivy.scala:229)
[error] at sbt.internal.librarymanagement.IvySbt.$anonfun$withIvy$1(Ivy.scala:190)
[error] at sbt.internal.librarymanagement.IvySbt.sbt$internal$librarymanagement$IvySbt$$action$1(Ivy.scala:70)
[error] at sbt.internal.librarymanagement.IvySbt$$anon$3.call(Ivy.scala:77)
[error] at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
[error] at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
[error] at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
[error] at xsbt.boot.Using$.withResource(Using.scala:10)
[error] at xsbt.boot.Using$.apply(Using.scala:9)
[error] at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
[error] at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
[error] at xsbt.boot.Locks$.apply0(Locks.scala:31)
[error] at xsbt.boot.Locks$.apply(Locks.scala:28)
[error] at sbt.internal.librarymanagement.IvySbt.withDefaultLogger(Ivy.scala:77)
[error] at sbt.internal.librarymanagement.IvySbt.withIvy(Ivy.scala:185)
[error] at sbt.internal.librarymanagement.IvySbt.withIvy(Ivy.scala:182)
[error] at sbt.internal.librarymanagement.IvySbt$Module.withModule(Ivy.scala:228)
[error] at sbt.internal.librarymanagement.IvyActions$.updateEither(IvyActions.scala:190)
[error] at sbt.librarymanagement.ivy.IvyDependencyResolution.update(IvyDependencyResolution.scala:20)
[error] at sbt.librarymanagement.DependencyResolution.update(DependencyResolution.scala:56)
[error] at sbt.internal.LibraryManagement$.resolve$1(LibraryManagement.scala:38)
[error] at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$12(LibraryManagement.scala:91)
[error] at sbt.util.Tracked$.$anonfun$lastOutput$1(Tracked.scala:68)
[error] at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$19(LibraryManagement.scala:104)
[error] at scala.util.control.Exception$Catch.apply(Exception.scala:224)
[error] at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11(LibraryManagement.scala:104)
[error] at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11$adapted(LibraryManagement.scala:87)
[error] at sbt.util.Tracked$.$anonfun$inputChanged$1(Tracked.scala:149)
[error] at sbt.internal.LibraryManagement$.cachedUpdate(LibraryManagement.scala:118)
[error] at sbt.Classpaths$.$anonfun$updateTask$5(Defaults.scala:2353)
[error] at scala.Function1.$anonfun$compose$1(Function1.scala:44)
[error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:42)
[error] at sbt.std.Transform$$anon$4.work(System.scala:64)
[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:257)
[error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:16)
[error] at sbt.Execute.work(Execute.scala:266)
[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:257)
[error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:167)
[error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:32)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] at java.lang.Thread.run(Thread.java:745)
[error] (:update) sbt.librarymanagement.ResolveException: unresolved dependency: ch.ethz.dalab#dissolvestruct_2.10;0.1-SNAPSHOT: not found
[error] (:ssExtractDependencies) sbt.librarymanagement.ResolveException: unresolved dependency: ch.ethz.dalab#dissolvestruct_2.10;0.1-SNAPSHOT: not found
[error] Total time: 19 s, completed Apr 7, 2018 9:57:39 PM
how can i solve this error?
Hello,
I am encountering an error when installing your requirements. You stated that you tested with SBT 0.31. According to the same link you provide, such version of SBT is no longer available (or it even seems it never existed)
Therefore I just downloaded the most recent version from here.
Then I run the code as you suggest with the following command
sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures index.html desired_output"
Sadly after some downloads, I encountered the following errors:
[error] web2text/cdom/Node.scala:39:33: missing parameter type
[error] val c = for (c <- children; l <- c.toString.lines) yield {" " + l}
^
[error] web2text/src/main/scala/ch/ethz/dalab/web2text/cleaneval/CleanEval.scala:140:54: value drop is not a member of java.util.stream.Stream[String]
[error] val contents = if (f.startsWith("URL:")) f.lines.drop(1).mkString("\n")
^
[error] web2text/src/main/scala/ch/ethz/dalab/web2text/features/PageFeatures.scala:29:63: type mismatch;
[error] found : java.util.stream.Stream[String]
[error] required: Iterator[?]
[error] (blockFeatureLabels.toIterator zip blockFeatures.toString.lines)
[error] three errors found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 6 s, completed Dec 23, 2020, 5:07:31 PM
How else can I compile your code? Is there any docker image you made back then that it is possible to run today?
Thanks!
Is it possible to test your approach on newer datasets such as Dragnet? Cleaneval is really old and it doesn't reflect modern website designs.
This is a great scala library for parsing html. Just wondering if this can be used for commercial purpose. thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.