linkedin / avro2tf Goto Github PK
View Code? Open in Web Editor NEWAvro2TF is designed to fill the gap of making users' training data ready to be consumed by deep learning training frameworks.
License: BSD 2-Clause "Simplified" License
Avro2TF is designed to fill the gap of making users' training data ready to be consumed by deep learning training frameworks.
License: BSD 2-Clause "Simplified" License
The README talks about setting the dtype to 'sparseVector', which is an older form of this library. You'd get the error above about "No value found for 'sparseVector'".
The documentation should talk more about how the fields "isSparse": true | false , "dtype": "float", and "isDocumentFeature": true | false are used together to produce the indices and values output and how the naming works. This will allow people to know how to use the library from the documentation without having to go into the tutorial.
In the README, it uses 'words_wideFeatures_sparse', but doesn't give any details.
Opening this issue to add support to Google Cloud Dataproc for Avro2TF.
I use the default json file
{
"features": [
{
"inputFeatureInfo": {
"columnExpr": "userId"
},
"outputTensorInfo": {
"name": "userId",
"dtype": "long",
"shape": [
-1
]
}
},
{
"inputFeatureInfo": {
"columnExpr": "movieId",
"transformConfig": {
"hashInfo": {
"hashBucketSize": 1000,
"numHashFunctions": 4
}
}
},
"outputTensorInfo": {
"name": "movieId_hashed",
"dtype": "long",
"shape": [
4
]
}
},
{
"inputFeatureInfo": {
"columnExpr": "genreFeatures.term"
},
"outputTensorInfo": {
"name": "genreFeatures_term",
"dtype": "long",
"shape": [
-1
]
}
},
{
"inputFeatureInfo": {
"columnConfig": {
"genreFeatures": {
"whitelist": [
"Genre"
]
},
"movieLatentFactorFeatures": {
"blacklist": [
"0"
]
}
},
"transformConfig": {
"hashInfo": {
"hashBucketSize": 100,
"combiner": "AVG"
}
}
},
"outputTensorInfo": {
"name": "genreFeatures_movieLatentFactorFeatures",
"dtype": "SparseVector",
"shape": []
}
}
],
"labels": [
{
"inputFeatureInfo": {
"columnExpr": "response"
},
"outputTensorInfo": {
"name": "response",
"dtype": "double",
"shape": []
}
}
]
}
it throws the following exception:
Error: Option --avro2tf-config-path failed when given 'tensorizeIn_config_movielens.json'. java.util.NoSuchElementException: No value found for 'sparseVector'
at scala.Enumeration.withName(Enumeration.scala:124)
at io.circe.Decoder$$anonfun$enumDecoder$1$$anonfun$apply$22$$anonfun$apply$23.apply(Decoder.scala:1097)
at io.circe.Decoder$$anonfun$enumDecoder$1$$anonfun$apply$22$$anonfun$apply$23.apply(Decoder.scala:1097)
at scala.util.Try$.apply(Try.scala:192)
at io.circe.Decoder$$anonfun$enumDecoder$1$$anonfun$apply$22.apply(Decoder.scala:1097)
at io.circe.Decoder$$anonfun$enumDecoder$1$$anonfun$apply$22.apply(Decoder.scala:1096)
at io.circe.Decoder$$anon$37.apply(Decoder.scala:438)
at io.circe.Decoder$class.tryDecode(Decoder.scala:46)
at io.circe.Decoder$$anon$37.tryDecode(Decoder.scala:437)
at io.circe.Decoder$$anon$22.tryDecode(Decoder.scala:94)
at com.linkedin.avro2tf.parsers.Avro2TFConfigParser$$anonfun$1$anon$importedDecoder$macro$190$1$$anon$25.configuredDecode(Avro2TFConfigParser.scala:31)
at io.circe.generic.extras.decoding.ConfiguredDecoder$CaseClassConfiguredDecoder.apply(ConfiguredDecoder.scala:58)
at io.circe.Decoder$class.tryDecode(Decoder.scala:46)
at io.circe.generic.decoding.DerivedDecoder.tryDecode(DerivedDecoder.scala:6)
at com.linkedin.avro2tf.parsers.Avro2TFConfigParser$$anonfun$1$anon$importedDecoder$macro$190$1$$anon$30.configuredDecode(Avro2TFConfigParser.scala:31)
at io.circe.generic.extras.decoding.ConfiguredDecoder$CaseClassConfiguredDecoder.apply(ConfiguredDecoder.scala:58)
at io.circe.Decoder$class.decodeJson(Decoder.scala:64)
at io.circe.generic.decoding.DerivedDecoder.decodeJson(DerivedDecoder.scala:6)
at io.circe.Parser$class.finishDecode(Parser.scala:13)
at io.circe.config.parser$.finishDecode(parser.scala:64)
at io.circe.config.parser$.decode(parser.scala:164)
at io.circe.config.syntax$CirceConfigOps$.as$extension0(syntax.scala:176)
at com.linkedin.avro2tf.parsers.Avro2TFConfigParser$$anonfun$1.apply(Avro2TFConfigParser.scala:31)
at com.linkedin.avro2tf.parsers.Avro2TFConfigParser$$anonfun$1.apply(Avro2TFConfigParser.scala:31)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.linkedin.avro2tf.parsers.Avro2TFConfigParser$.getAvro2TFConfiguration(Avro2TFConfigParser.scala:31)
at com.linkedin.avro2tf.parsers.Avro2TFJobParamsParser$$anon$1$$anonfun$11.apply(Avro2TFJobParamsParser.scala:208)
at com.linkedin.avro2tf.parsers.Avro2TFJobParamsParser$$anon$1$$anonfun$11.apply(Avro2TFJobParamsParser.scala:179)
at scopt.OptionDef$$anonfun$34.apply(options.scala:600)
at scopt.OptionDef.applyArgument(options.scala:679)
at scopt.OptionParser.scopt$OptionParser$$handleArgument$1(options.scala:444)
at scopt.OptionParser.parse(options.scala:490)
at com.linkedin.avro2tf.parsers.Avro2TFJobParamsParser$.parse(Avro2TFJobParamsParser.scala:359)
at com.tencent.weishi.recall.DataFrame2TFRecord$.main(DataFrame2TFRecord.scala:169)
at com.tencent.weishi.recall.DataFrame2TFRecord.main(DataFrame2TFRecord.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:727)
In many cases, having a sorted featurelist for the vectorization ensures the same mapping across different avro2tf runs.
In particular, when --num-output-files is unspecified, the output data is not repartitioned.
Notice: During feature indices conversion, we need to convert the feature value into a float first.
The parameter is required, but its documentation states that it is optional.
https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector
since you already use spark to do transformation. what's the benefit of convert Avro2TF instead of a format free connector?
and there is actually a avro dataset reader exists here https://github.com/tensorflow/io/blob/master/tensorflow_io/avro/20190307-avro-dataset.md
For example, in one field of the original data, some records are in Integer format, some records are in Double format.
Plus Jupyter notebook tutorial.
It would be great if TensorizeIn config supported HOCON as that would allow comments.
There appears to be an intermittent performance issue, where the "head at TensorMetadataGeneration.scala:113" job can take almost an hour in a single executor, while all other executors take less than 1-2 minutes.
Perhaps agg(Map[String, String])
is less efficient (or less Catalyst-optimizable) than agg(Column, Column*)
.
In looking at the diff in #39, I noticed that DeserializationFeature.ACCEPT_SINGLE_VALUE_AS_ARRAY
was set to true
, but, after #31 was merged, this effectively became false
. This behavior wasn't documented or tested, but could be restored if desired. (I assume it was intended to be used with the labels
field.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.