linkedin / avro2tf Goto Github PK

Avro2TF is designed to fill the gap of making users' training data ready to be consumed by deep learning training frameworks.

License: BSD 2-Clause "Simplified" License

Scala 100.00%

deep-learning linkedin machine-learning tensorflow

avro2tf's People

Contributors

Stargazers

Watchers

Forkers

tspannhw holdenk mengwangk nunofernandes-plight stjordanis jiangyu yph152 cnxtech ddicato wensheng-sun sacsar alice2008 ganeshparameswaran annyan09023 jmscraig plliao hubayirp mayiming zhangxuhong

avro2tf's Issues

[Release Blog] Avro2TF TF-Ranking Support

Add Support to Fill Default Values for Null Features with Any Data Type

Add Support to Convert Game NTV Data into Spark MLlib Sparse Vector

Add more user-friendly exception messages to inform users where might be wrong in the future

[Documentation] Avro2TF TF-Ranking Support

Add Avro2TF Native Support on Multi-dimensional Features

java.util.NoSuchElementException: No value found for 'sparseVector'

The README talks about setting the dtype to 'sparseVector', which is an older form of this library. You'd get the error above about "No value found for 'sparseVector'".

The documentation should talk more about how the fields "isSparse": true | false , "dtype": "float", and "isDocumentFeature": true | false are used together to produce the indices and values output and how the naming works. This will allow people to know how to use the library from the documentation without having to go into the tutorial.

In the README, it uses 'words_wideFeatures_sparse', but doesn't give any details.

GCP support via Dataproc

Opening this issue to add support to Google Cloud Dataproc for Avro2TF.

Add initialization action
Create example (Read BigQuery table (Public Datasets), export table to Avro format and use Avro2TF to generate TF records).
Train a model using TF records.

demo data throws NoSuchElementException: No value found for 'sparseVector' when parsing json config file

I use the default json file

{
  "features": [
    {
      "inputFeatureInfo": {
        "columnExpr": "userId"
      },
      "outputTensorInfo": {
        "name": "userId",
        "dtype": "long",
        "shape": [
          -1
        ]
      }
    },
    {
      "inputFeatureInfo": {
        "columnExpr": "movieId",
        "transformConfig": {
          "hashInfo": {
            "hashBucketSize": 1000,
            "numHashFunctions": 4
          }
        }
      },
      "outputTensorInfo": {
        "name": "movieId_hashed",
        "dtype": "long",
        "shape": [
          4
        ]
      }
    },
    {
      "inputFeatureInfo": {
        "columnExpr": "genreFeatures.term"
      },
      "outputTensorInfo": {
        "name": "genreFeatures_term",
        "dtype": "long",
        "shape": [
          -1
        ]
      }
    },
    {
      "inputFeatureInfo": {
        "columnConfig": {
          "genreFeatures": {
            "whitelist": [
              "Genre"
            ]
          },
          "movieLatentFactorFeatures": {
            "blacklist": [
              "0"
            ]
          }
        },
        "transformConfig": {
          "hashInfo": {
            "hashBucketSize": 100,
            "combiner": "AVG"
          }
        }
      },
      "outputTensorInfo": {
        "name": "genreFeatures_movieLatentFactorFeatures",
        "dtype": "SparseVector",
        "shape": []
      }
    }
  ],
  "labels": [
    {
      "inputFeatureInfo": {
        "columnExpr": "response"
      },
      "outputTensorInfo": {
        "name": "response",
        "dtype": "double",
        "shape": []
      }
    }
  ]
}

it throws the following exception:

Error: Option --avro2tf-config-path failed when given 'tensorizeIn_config_movielens.json'. java.util.NoSuchElementException: No value found for 'sparseVector'
	at scala.Enumeration.withName(Enumeration.scala:124)
	at io.circe.Decoder$$anonfun$enumDecoder$1$$anonfun$apply$22$$anonfun$apply$23.apply(Decoder.scala:1097)
	at io.circe.Decoder$$anonfun$enumDecoder$1$$anonfun$apply$22$$anonfun$apply$23.apply(Decoder.scala:1097)
	at scala.util.Try$.apply(Try.scala:192)
	at io.circe.Decoder$$anonfun$enumDecoder$1$$anonfun$apply$22.apply(Decoder.scala:1097)
	at io.circe.Decoder$$anonfun$enumDecoder$1$$anonfun$apply$22.apply(Decoder.scala:1096)
	at io.circe.Decoder$$anon$37.apply(Decoder.scala:438)
	at io.circe.Decoder$class.tryDecode(Decoder.scala:46)
	at io.circe.Decoder$$anon$37.tryDecode(Decoder.scala:437)
	at io.circe.Decoder$$anon$22.tryDecode(Decoder.scala:94)
	at com.linkedin.avro2tf.parsers.Avro2TFConfigParser$$anonfun$1$anon$importedDecoder$macro$190$1$$anon$25.configuredDecode(Avro2TFConfigParser.scala:31)
	at io.circe.generic.extras.decoding.ConfiguredDecoder$CaseClassConfiguredDecoder.apply(ConfiguredDecoder.scala:58)
	at io.circe.Decoder$class.tryDecode(Decoder.scala:46)
	at io.circe.generic.decoding.DerivedDecoder.tryDecode(DerivedDecoder.scala:6)
	at com.linkedin.avro2tf.parsers.Avro2TFConfigParser$$anonfun$1$anon$importedDecoder$macro$190$1$$anon$30.configuredDecode(Avro2TFConfigParser.scala:31)
	at io.circe.generic.extras.decoding.ConfiguredDecoder$CaseClassConfiguredDecoder.apply(ConfiguredDecoder.scala:58)
	at io.circe.Decoder$class.decodeJson(Decoder.scala:64)
	at io.circe.generic.decoding.DerivedDecoder.decodeJson(DerivedDecoder.scala:6)
	at io.circe.Parser$class.finishDecode(Parser.scala:13)
	at io.circe.config.parser$.finishDecode(parser.scala:64)
	at io.circe.config.parser$.decode(parser.scala:164)
	at io.circe.config.syntax$CirceConfigOps$.as$extension0(syntax.scala:176)
	at com.linkedin.avro2tf.parsers.Avro2TFConfigParser$$anonfun$1.apply(Avro2TFConfigParser.scala:31)
	at com.linkedin.avro2tf.parsers.Avro2TFConfigParser$$anonfun$1.apply(Avro2TFConfigParser.scala:31)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at com.linkedin.avro2tf.parsers.Avro2TFConfigParser$.getAvro2TFConfiguration(Avro2TFConfigParser.scala:31)
	at com.linkedin.avro2tf.parsers.Avro2TFJobParamsParser$$anon$1$$anonfun$11.apply(Avro2TFJobParamsParser.scala:208)
	at com.linkedin.avro2tf.parsers.Avro2TFJobParamsParser$$anon$1$$anonfun$11.apply(Avro2TFJobParamsParser.scala:179)
	at scopt.OptionDef$$anonfun$34.apply(options.scala:600)
	at scopt.OptionDef.applyArgument(options.scala:679)
	at scopt.OptionParser.scopt$OptionParser$$handleArgument$1(options.scala:444)
	at scopt.OptionParser.parse(options.scala:490)
	at com.linkedin.avro2tf.parsers.Avro2TFJobParamsParser$.parse(Avro2TFJobParamsParser.scala:359)
	at com.tencent.weishi.recall.DataFrame2TFRecord$.main(DataFrame2TFRecord.scala:169)
	at com.tencent.weishi.recall.DataFrame2TFRecord.main(DataFrame2TFRecord.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:727)

Add an option to sort the feature list and use the sorted feature list as the mapping

In many cases, having a sorted featurelist for the vectorization ensures the same mapping across different avro2tf runs.

Add Support to Convert Tensor Data into Spark MLlib Sparse Vector

Add Support to Convert Spark MLlib Sparse Vector into Tensor Data

Add Support to Convert Game NTV Data into Spark MLlib Dense Vector

Add Support to Pass through Columns Without Any Transformation and Conversion

Add Feature Mapping Sharing and Feature Frequency Sorting Capacity to Avro2TF

Add Support to Map Not Only an Array of Strings to Ids but also a String to an Id

TensorizeIn does not respect --shuffle=true

In particular, when --num-output-files is unspecified, the output data is not repartitioned.

TensorizeIn: When --skip-conversion=true, the job output is not saved to HDFS

Support Game Data in NTV format with Spark SQL Column Expression

Notice: During feature indices conversion, we need to convert the feature value into a float first.

PrepRankingData: --group-list-max-size param documentation says "optional"

The parameter is required, but its documentation states that it is optional.

there is already a spark.dataframe to tfrecord connector in tf/ecosystem though.

https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector

since you already use spark to do transformation. what's the benefit of convert Avro2TF instead of a format free connector?

and there is actually a avro dataset reader exists here https://github.com/tensorflow/io/blob/master/tensorflow_io/avro/20190307-avro-dataset.md

[Tutorial] Avro2TF TF-Ranking Support

Add Support on Converting Data in Union Type

For example, in one field of the original data, some records are in Integer format, some records are in Double format.

Add Support to Convert Tensor Data into Spark MLlib Dense Vector

Add Support for TF.Ranking with Training Data in TFRecord Format

Plus Jupyter notebook tutorial.

Add HOCON support for TensorizeIn config

It would be great if TensorizeIn config supported HOCON as that would allow comments.

Partition skew when TensorizeIn computes 'max' of its integer columns

There appears to be an intermittent performance issue, where the "head at TensorMetadataGeneration.scala:113" job can take almost an hour in a single executor, while all other executors take less than 1-2 minutes.

Perhaps agg(Map[String, String]) is less efficient (or less Catalyst-optimizable) than agg(Column, Column*).

Add Support for TF.Ranking with Training Data in Avro Format

Add Support to Convert Spark MLlib Dense Vector into Tensor Data

Add More User-friendly Checks on Invalid User Configuration Combinations

Restore support for "accept single value as array" in config?

In looking at the diff in #39, I noticed that DeserializationFeature.ACCEPT_SINGLE_VALUE_AS_ARRAY was set to true, but, after #31 was merged, this effectively became false. This behavior wasn't documented or tested, but could be restored if desired. (I assume it was intended to be used with the labels field.)