Giter Site home page Giter Site logo

deepdive's People

Contributors

adamwgoldberg avatar ajratner avatar alexisbrenon avatar alldefector avatar amirabs avatar atlas7 avatar bhancock8 avatar bryanhe avatar byoo1 avatar chrismre avatar clfarron4 avatar daniter-cu avatar dennybritz avatar feiranwang avatar henryre avatar igozali avatar juhanaka avatar mikecafarella avatar msushkov avatar netj avatar raphaelhoffmann avatar rionda avatar sandipchatterjee avatar senwu avatar seojiwon avatar shahin avatar thomaspalomares avatar xiaoling avatar zhangce avatar zifeishan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepdive's Issues

Allow extractors to output to multiple relations

All data of an extractor is currently written to the relation specified in the output_relation setting. It would be useful to allow extractors to write to multiple relations. One way to implement this would be to allow a _relation key in the JSON output and use that value for grouping.

Error with ocr example in develop branch

Error in develop branch but not in master.

14:05:38.050 [default-dispatcher-2][profiler][Profiler] DEBUG starting report_id=inference_grounding
14:05:38.051 [default-dispatcher-3][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore(akka://deepdive)][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore] INFO  Writing grounding queries to file="/var/folders/rz/0l6t9_w90hs_k6l6fq7nlsxm0000gn/T/grounding8297874664321351755.sql" 
14:05:38.052 [default-dispatcher-6][taskManager][TaskManager] INFO  Added task_id=inference
14:05:38.053 [default-dispatcher-6][taskManager][TaskManager] INFO  0/1 tasks eligible.
14:05:38.053 [default-dispatcher-6][taskManager][TaskManager] INFO  Tasks not_eligible: Set(inference)
14:05:38.054 [default-dispatcher-6][taskManager][TaskManager] INFO  Added task_id=calibration
14:05:38.054 [default-dispatcher-6][taskManager][TaskManager] INFO  0/2 tasks eligible.
14:05:38.055 [default-dispatcher-6][taskManager][TaskManager] INFO  Tasks not_eligible: Set(inference, calibration)
14:05:38.056 [default-dispatcher-6][taskManager][TaskManager] INFO  Added task_id=report
14:05:38.057 [default-dispatcher-6][taskManager][TaskManager] INFO  0/3 tasks eligible.
14:05:38.058 [default-dispatcher-6][taskManager][TaskManager] INFO  Tasks not_eligible: Set(inference, report, calibration)
14:05:38.058 [default-dispatcher-6][taskManager][TaskManager] INFO  Added task_id=shutdown
14:05:38.059 [default-dispatcher-6][taskManager][TaskManager] INFO  0/4 tasks eligible.
14:05:38.059 [default-dispatcher-6][taskManager][TaskManager] INFO  Tasks not_eligible: Set(shutdown, inference, report, calibration)
14:05:38.076 [default-dispatcher-3][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore(akka://deepdive)][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore] INFO  Executing grounding query...
14:05:38.351 [][][StatementExecutor$$anon$1] ERROR SQL execution failed (Reason: ERROR: invalid input syntax for integer: ""
  Position: 184):

   INSERT INTO dd_graph_weights(initial_value, is_fixed, description) SELECT DISTINCT 0.0 AS wValue, false AS wIsFixed, 'label1-' || (CASE WHEN "features.feature_id" IS NULL THEN '' ELSE "features.feature_id" END) || "label1_val_cardinality" AS wCmd FROM label1_query GROUP BY wValue, wIsFixed, wCmd

14:05:38.370 [default-dispatcher-3][inferenceManager][OneForOneStrategy] ERROR ERROR: invalid input syntax for integer: ""
  Position: 184
org.postgresql.util.PSQLException: ERROR: invalid input syntax for integer: ""
  Position: 184
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2157) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:555) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:417) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:410) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.apache.commons.dbcp.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:172) ~[commons-dbcp-1.4.jar:1.4]
    at org.apache.commons.dbcp.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:172) ~[commons-dbcp-1.4.jar:1.4]
    at scalikejdbc.StatementExecutor$$anonfun$execute$1.apply$mcZ$sp(StatementExecutor.scala:295) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$$anonfun$execute$1.apply(StatementExecutor.scala:295) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$$anonfun$execute$1.apply(StatementExecutor.scala:295) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$NakedExecutor.apply(StatementExecutor.scala:33) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$$anon$1.scalikejdbc$StatementExecutor$LoggingSQLAndTiming$$super$apply(StatementExecutor.scala:291) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$LoggingSQLAndTiming$class.apply(StatementExecutor.scala:238) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$$anon$1.scalikejdbc$StatementExecutor$LoggingSQLIfFailed$$super$apply(StatementExecutor.scala:291) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$LoggingSQLIfFailed$class.apply(StatementExecutor.scala:269) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$$anon$1.apply(StatementExecutor.scala:291) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor.execute(StatementExecutor.scala:295) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DBSession$$anonfun$executeWithFilters$1.apply(DBSession.scala:248) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DBSession$$anonfun$executeWithFilters$1.apply(DBSession.scala:246) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.LoanPattern$.using(LoanPattern.scala:29) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.package$.using(package.scala:76) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DBSession$class.executeWithFilters(DBSession.scala:245) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.ActiveSession.executeWithFilters(DBSession.scala:420) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.SQLExecution.apply(SQL.scala:441) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at org.deepdive.inference.SQLInferenceDataStore$$anonfun$4$$anonfun$apply$4.apply(SQLInferenceDataStore.scala:39) ~[classes/:na]
    at org.deepdive.inference.SQLInferenceDataStore$$anonfun$4$$anonfun$apply$4.apply(SQLInferenceDataStore.scala:38) ~[classes/:na]
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) ~[scala-library.jar:0.13.1]
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) ~[scala-library.jar:0.13.1]
    at org.deepdive.inference.SQLInferenceDataStore$$anonfun$4.apply(SQLInferenceDataStore.scala:38) ~[classes/:na]
    at org.deepdive.inference.SQLInferenceDataStore$$anonfun$4.apply(SQLInferenceDataStore.scala:37) ~[classes/:na]
    at scalikejdbc.DBConnection$$anonfun$autoCommit$1.apply(DB.scala:185) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DBConnection$$anonfun$autoCommit$1.apply(DB.scala:184) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.LoanPattern$.using(LoanPattern.scala:29) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.package$.using(package.scala:76) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DBConnection$class.autoCommit(DB.scala:184) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DB.autoCommit(DB.scala:498) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DB$$anonfun$autoCommit$2.apply(DB.scala:641) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DB$$anonfun$autoCommit$2.apply(DB.scala:640) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.LoanPattern$.using(LoanPattern.scala:29) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.package$.using(package.scala:76) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DB$.autoCommit(DB.scala:640) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at org.deepdive.inference.SQLInferenceDataStore$class.execute(SQLInferenceDataStore.scala:37) ~[classes/:na]
    at org.deepdive.inference.PostgresInferenceDataStoreComponent$PostgresInferenceDataStore.execute(PostgresInferenceDataStore.scala:19) ~[classes/:na]
    at org.deepdive.inference.SQLInferenceDataStore$class.groundFactorGraph(SQLInferenceDataStore.scala:536) ~[classes/:na]
    at org.deepdive.inference.PostgresInferenceDataStoreComponent$PostgresInferenceDataStore.groundFactorGraph(PostgresInferenceDataStore.scala:19) ~[classes/:na]
    at org.deepdive.inference.InferenceManager$$anonfun$receive$1.applyOrElse(InferenceManager.scala:59) ~[classes/:na]
    at akka.actor.Actor$class.aroundReceive(Actor.scala:467) ~[akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at org.deepdive.inference.InferenceManager$PostgresInferenceManager.aroundReceive(InferenceManager.scala:116) ~[classes/:na]
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:491) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.actor.ActorCell.invoke(ActorCell.scala:462) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.Mailbox.run(Mailbox.scala:219) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:385) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library.jar:na]
14:05:38.372 [default-dispatcher-6][inferenceManager][InferenceManager$PostgresInferenceManager] INFO  Starting
14:05:38.372 [default-dispatcher-3][factorGraphBuilder][FactorGraphBuilder$PostgresFactorGraphBuilder] INFO  Starting

udf_extractor -> json_extractor

udf_extractor (the default one) is really a bad naming, since plpy/tsv extractors also have "udf".

We'll recently change all of them to json_extractor.

Nested Example

change

deepdive.extractions: {
  wordsExtractor.style: "udf_extractor"
  wordsExtractor.output_relation: "words"
  wordsExtractor.input: "SELECT * FROM titles"
  wordsExtractor.udf: "words.py"
}

to

deepdive.extraction.extractors: {
  wordsExtractor {
    style: "udf_extractor"
    output_relation: "words"
    input: "SELECT * FROM titles"
    udf: "words.py"
  }
}

(1) extractions->extraction.extractors; (2) nested is easier for reading

In all documentations on web.

Model examples

  • MRF
  • Linear CRF
  • Skip-chain CRF
  • Bayes Net
  • LDA
  • Min-Cut
  • Correlation Clustering

Code review: new extractors

  • code review about following components:
    • New extractor path 1: plpy_extractor
    • New extractor path 2: tsv_extractor
    • Extractor path 3: sql_extractor
    • Extractor path 4: cmd_extractor

If multiple variables in same table, tables must be renamed in variable schema to prevent ID conflict

I tried to update the smoke example for develop branch. I changed the syntax to current setting, but the grounding SQL script failed here:

INSERT INTO dd_graph_variables(id, data_type, initial_value, is_evidence, cardinality)
        SELECT people.id, 'Boolean', people.smokes::int, (people.smokes IS NOT NULL), null
        FROM people;

        DROP TABLE IF EXISTS people_smokes_cardinality CASCADE;
CREATE TABLE people_smokes_cardinality(people_smokes_cardinality) AS VALUES (1) WITH DATA;
INSERT INTO dd_graph_variables(id, data_type, initial_value, is_evidence, cardinality)
        SELECT people.id, 'Boolean', people.has_cancer::int, (people.has_cancer IS NOT NULL), null
        FROM people;

        DROP TABLE IF EXISTS people_has_cancer_cardinality CASCADE;
CREATE TABLE people_has_cancer_cardinality(people_has_cancer_cardinality) AS VALUES (1) WITH DATA;
INSERT INTO dd_graph_variables_map(variable_id)
      SELECT id FROM dd_graph_variables;
INSERT INTO dd_graph_variables_holdout(variable_id)
        SELECT id FROM dd_graph_variables
        WHERE RANDOM() < 0.0 AND is_evidence = true;
UPDATE dd_graph_variables SET is_evidence=false
      WHERE dd_graph_variables.id IN (SELECT variable_id FROM dd_graph_variables_holdout);

The error is:

21:53:48.558 [default-dispatcher-2][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore(akka://deepdive)][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore] INFO  Executing grounding query...
21:53:57.533 [][][StatementExecutor$$anon$1] ERROR SQL execution failed (Reason: ERROR: duplicate key violates unique constraint "dd_graph_variables_pkey"  (seg20 rulk.stanford.edu:40000 pid=25436)):

   DROP TABLE IF EXISTS people_smokes_cardinality CASCADE;CREATE TABLE people_smokes_cardinality(people_smokes_cardinality) AS VALUES (1) WITH DATA;INSERT INTO dd_graph_variables(id, data_type, initial_value, is_evidence, cardinality) SELECT people.id, 'Boolean', people.has_cancer::int, (people.has_cancer IS NOT NULL), null FROM people

21:53:57.558 [default-dispatcher-2][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore(akka://deepdive)][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore] ERROR org.postgresql.util.PSQLException: ERROR: duplicate key violates unique constraint "dd_graph_variables_pkey"  (seg20 rulk.stanford.edu:40000 pid=25436)
21:53:57.559 [default-dispatcher-2][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore(akka://deepdive)][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore] INFO  [Error] Please check the SQL cmd!
21:53:57.644 [default-dispatcher-5][inferenceManager][OneForOneStrategy] ERROR ERROR: duplicate key violates unique constraint "dd_graph_variables_pkey"  (seg20 rulk.stanford.edu:40000 pid=25436)
org.postgresql.util.PSQLException: ERROR: duplicate key violates unique constraint "dd_graph_variables_pkey"  (seg20 rulk.stanford.edu:40000 pid=25436)
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2157) ~[postgresql-9.2-1003-jdbc4.jar:na]

What seeems to cause the error is that variable "smokes" and "has_cancer" are in the same table, and system tries to use the row ID as variable ID, but it fails since variables cannot have duplicate IDs...

Any suggestions?

Do not do inference in pipeline if there's no rule

without inference rule, the system should not do inference at all. (extract only)
Now it has error message like:

22:26:52 [inferenceManager] ERROR /afs/cs.stanford.edu/u/zifei/repos/deepdive/out/2014-04-21T222429/graph.weights (No such file or directory)

Extract-only Pipelines

Users should be able to only perform extractions, while skipping grounding, learning, inference and calibration. Or a more flexible pipeline should be supported.

Pure SQL extractors

People can currently write pure SQL extractors by using an empty extractor, but that's a "hack". We should have a principled way to allow for pure SQL extractor. The difficulty here is the assignment of unique variable IDs.

Warn / report error on bad configuration

There should be error messages when encountering unexpected configuration items.

Just now I mistyped "dependencies" as "depencencies", and there is no error message on parsing the config file. Therefore the dependencies is broken but programmers do not know what cause the problem.

Strongly suggest that unexpected configs should be abandoned, or at least warned.

Error thrown during sampling. java.lang.UnsupportedOperationException: empty.reduceLeft

I am getting the following error, although the setup is nearly identical to the deepdive_spouse example, and all the dd factors in postgreSQL are full (included below).

The application.conf file is available at https://github.com/tomMulholland/isDB

17:46:26.559 [Thread-23][sampler][Sampler] INFO  17:46:26.559 [main] DEBUG org.dennybritz.sampler.Runner$ - Creating factor graph...
17:46:26.640 [Thread-23][sampler][Sampler] INFO  17:46:26.639 [main] DEBUG org.dennybritz.sampler.Runner$ - Starting learning phase...
17:46:27.586 [Thread-23][sampler][Sampler] INFO  17:46:27.585 [main] DEBUG org.dennybritz.sampler.Learner - num_iterations=120
17:46:27.587 [Thread-23][sampler][Sampler] INFO  17:46:27.585 [main] DEBUG org.dennybritz.sampler.Learner - num_samples_per_iteration=1
17:46:27.587 [Thread-23][sampler][Sampler] INFO  17:46:27.586 [main] DEBUG org.dennybritz.sampler.Learner - learning_rate=0.1
17:46:27.588 [Thread-23][sampler][Sampler] INFO  17:46:27.587 [main] DEBUG org.dennybritz.sampler.Learner - diminish_rate=0.95
17:46:27.588 [Thread-23][sampler][Sampler] INFO  17:46:27.587 [main] DEBUG org.dennybritz.sampler.Learner - regularization_constant=0.01
17:46:27.589 [Thread-23][sampler][Sampler] INFO  17:46:27.587 [main] DEBUG org.dennybritz.sampler.Learner - num_factors=267260 num_query_factors=75456
17:46:27.590 [Thread-23][sampler][Sampler] INFO  17:46:27.587 [main] DEBUG org.dennybritz.sampler.Learner - num_weights=143009 num_query_weights=49011
17:46:27.590 [Thread-23][sampler][Sampler] INFO  17:46:27.587 [main] DEBUG org.dennybritz.sampler.Learner - num_query_variables=1791 num_evidence_variables=1227
17:46:27.751 [Thread-23][sampler][Sampler] INFO  17:46:27.750 [main] DEBUG org.dennybritz.sampler.Learner - iteration=0 learning_rate=0.1
Exception in thread "main" scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.lang.UnsupportedOperationException: empty.reduceLeft
scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:124)
scala.collection.immutable.List.reduceLeft(List.scala:84)
scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
scala.collection.AbstractTraversable.reduce(Traversable.scala:105)
org.dennybritz.sampler.SamplingUtils$.sampleVariable(SamplingUtils.scala:34)
org.dennybritz.sampler.SamplingUtils$$anonfun$sampleVariables$1.apply$mcVI$sp(SamplingUtils.scala:42)
org.dennybritz.sampler.SamplingUtils$$anonfun$sampleVariables$1.apply(SamplingUtils.scala:42)
org.dennybritz.sampler.SamplingUtils$$anonfun$sampleVariables$1.apply(SamplingUtils.scala:42)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.parallel.immutable.ParHashSet$ParHashSetIterator.foreach(ParHashSet.scala:76)
.
.
.
    at scala.collection.parallel.package$$anon$1.alongWith(package.scala:85)
    at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
    at scala.collection.parallel.ParIterableLike$Foreach.mergeThrowables(ParIterableLike.scala:972)
    at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
    at scala.collection.parallel.ParIterableLike$Foreach.tryMerge(ParIterableLike.scala:972)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
    at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
17:46:28.161 [default-dispatcher-11][inferenceManager][OneForOneStrategy] ERROR sampling failed (see error log for more details)
java.lang.RuntimeException: sampling failed (see error log for more details)
    at org.deepdive.inference.Sampler$$anonfun$receive$1.applyOrElse(Sampler.scala:36) ~[classes/:na]
    at akka.actor.Actor$class.aroundReceive(Actor.scala:467) ~[akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at org.deepdive.inference.Sampler.aroundReceive(Sampler.scala:17) ~[classes/:na]
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:491) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.actor.ActorCell.invoke(ActorCell.scala:462) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.Mailbox.run(Mailbox.scala:219) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:385) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library.jar:na]
17:46:28.164 [default-dispatcher-11][sampler][LocalActorRef] INFO  Message [akka.actor.PoisonPill$] from Actor[akka://deepdive/user/inferenceManager#-1596663203] to Actor[akka://deepdive/user/inferenceManager/sampler#-1865594242] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
17:46:28.165 [default-dispatcher-4][inferenceManager][InferenceManager$PostgresInferenceManager] INFO  Starting
17:46:28.166 [default-dispatcher-11][factorGraphBuilder][FactorGraphBuilder$PostgresFactorGraphBuilder] INFO  Starting
17:46:56.074 [default-dispatcher-4][taskManager][TaskManager] INFO  Memory usage: 213/962MB (max: 962MB)

DeepDive variables are full.

isDB=# SELECT schemaname,relname,n_live_tup 
isDB-#   FROM pg_stat_user_tables 
isDB-#   ORDER BY n_live_tup DESC;
 schemaname |                 relname                 | n_live_tup 
------------+-----------------------------------------+------------
 public     | schol_features                          |     267260
 public     | f_is_schol_features_query               |     267260
 public     | selectedgesfordumpsql_raw               |     267260
 public     | dd_graph_edges                          |     267260
 public     | selectfactorsfordumpsql_raw             |     267260
 public     | dd_graph_factors                        |     267260
 public     | dd_graph_weights                        |     143009
 public     | selectweightsfordumpsql_raw             |     143009
 public     | selectvariablesfordumpsql_raw           |       3018
 public     | dd_graph_variables                      |       3018
 public     | scholarships                            |       3018
 public     | dd_graph_variables_map                  |       3018
 public     | websites                                |        852
 public     | schol_int_study                         |        645
 public     | financial_aid                           |        489
 public     | dd_graph_variables_holdout              |        269
 public     | factornum                               |          2
 public     | scholarships_is_scholarship_cardinality |          1
(18 rows)

 nfactor 
---------
       0
  267260
(2 rows)

Configuration sanity check

We should do a sanity check on the configuration upon loading. Instead of the application crashing in the middle of execution due to a configuration issue, we should immediately exit if we find an obvious mistake. Some things we can check for:

  • Does the schema match the variables used in factor functions?
  • Do all extractor UDF files exist?
  • Do all extractor output relations exist?

There probably are more things that we can check for.

Function executeSql Bug (in new sql_extractor)

There are potential errors in the new function executeSql in src/main/scala/org/deepdive/extraction/ExtractorRunner.scala.

Is the last commit well-tested? @senwu

I do sql:"select * from articles limit 10;" in an extractor and style:"sql_extractor", and the error goes like below:

22:04:29 [PostgresExtractionDataStore(akka://deepdive)] ERROR org.postgresql.util.PSQLException: A result was returned when none was expected.
22:04:29 [PostgresExtractionDataStore(akka://deepdive)] INFO  [Error] Please check the SQL cmd!
22:04:29 [extractorRunner-ext_test_sql] ERROR A result was returned when none was expected.
org.postgresql.util.PSQLException: A result was returned when none was expected.

When I try to build my own code on this function, I also got errors like:

21:43:47 [PostgresExtractionDataStore(akka://deepdive)] ERROR org.postgresql.util.PSQLException: No value specified for parameter 1.

@dennybritz : What is the right way to execute a sql query somewhere other than grounding?

holdout_query must ends with ";"

I found that holdout_query must be ended with ';' for now.
If I use the following:

holdout_query: "INSERT INTO dd_graph_variables_holdout(variable_id) select id from candidate where docid in (select docid from eval_docs)"

The SQL query will be:

21:59:06 [] ERROR SQL execution failed (Reason: ERROR: syntax error at or near "UPDATE"
  Position: 122):

   DROP TABLE IF EXISTS candidate_label_cardinality CASCADE;CREATE TABLE candidate_label_cardinality(candidate_label_cardinality) AS VALUES (1) WITH DATA;INSERT INTO dd_graph_variables_map(variable_id) SELECT id FROM dd_graph_variables;INSERT INTO dd_graph_variables_holdout(variable_id) select id from candidate where docid in (select docid from eval_docs)UPDATE dd_graph_variables SET is_evidence=false WHERE dd_graph_variables.id IN (SELECT variable_id FROM dd_graph_variables_holdout)

21:59:07 [inferenceManager] ERROR ERROR: syntax error at or near "UPDATE"
  Position: 122

Apparently the system missed the ";" after holdout_query. Should be easy to fix.

Variable ID Issue

  1. Sampler requires globally unique ID column for all tables containing variable.
  2. New grounding no longer reorder variable IDs, but only use ID columns of different tables, therefore IDs in different tables must be globally unique before grounding.
  3. We may reassign globally unique IDs before grounding, but currently IDs are explicit to users so that reassignment may break other references.
  4. default extractor assumes output relation has ID, and will automatically append ID column to the output JSON. (see: def buildCopySql in PostgresExtractionDataStore.scala)
  5. "bigserial" is slow in large-scale applications.

We must fix this before next code push.
@zhangce @feiranwang @msushkov

Spouse example: tsv_extractor number of rows mismatch

May have neglected some rows due to unknown parsing issues of TSV.

deepdive_spouse_tsv=# select count(*) from has_spouse_features ;                                                                                                                       count
--------
 151808
(1 row)

deepdive_spouse_tsv=# select count(*) from has_spouse;
 count
-------
 75446
(1 row)

deepdive_spouse_tsv=# select count(*) from people_mentions ;
 count
-------
 39269
(1 row)

Correct number should be:

deepdive_spouse_plpy=# select count(*) from has_spouse_features;
 count
--------
 151824
(1 row)

deepdive_spouse_plpy=# select count(*) from has_spouse;
 count
-------
 75454
(1 row)

deepdive_spouse_plpy=# select count(*) from people_mentions ;
 count
-------
 39270
(1 row)

(tested on other two extractors)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.