Comments (3)
Can you share more info regarding your errors?
There needs some more context.
from dps.
I run this command with default parameters in dedup_job.yaml on test_korean_jsonl_data
python bin/sparkapp.py dedup_job --config_path=./configs/dedup_job.yaml
Missing Python executable 'C:\Users\master\anaconda3\envs\dps\python', defaulting to 'C:\Users\master\anaconda3\envs\dps\Lib\site-packages\pyspark\bin\..' for SPARK_HOME environment variable. Please install Python or specify the correct
Python executable in PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON environment variable to detect SPARK_HOME safely.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "D:\UST\semester 3\Field Research\LLM\dps\bin\sparkapp.py", line 15, in <module>
dps.spark.run()
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\dps\spark\run.py", line 11, in run
fire.Fire(
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\fire\core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\fire\core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\dps\spark\jobs\dedup_job.py", line 80, in dedup_job
.reduceByKey(lambda x, y: x + y)
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\pyspark\rdd.py", line 1893, in reduceByKey
return self.combineByKey(lambda x: x, func, func, numPartitions, partitionFunc)
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\pyspark\rdd.py", line 2138, in combineByKey
numPartitions = self._defaultReducePartitions()
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\pyspark\rdd.py", line 2583, in _defaultReducePartitions
return self.getNumPartitions()
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\pyspark\rdd.py", line 2937, in getNumPartitions
return self._prev_jrdd.partitions().size()
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "C:\Users\master\anaconda3\envs\dps\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o52.partitions.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Ljava/lang/String;)Lorg/apache/hadoop/io/nativeio/NativeIO$POSIX$Stat;
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:608)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfoByNativeIO(RawLocalFileSystem.java:934)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:848)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:816)
at org.apache.hadoop.fs.LocatedFileStatus.<init>(LocatedFileStatus.java:52)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:2199)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:2179)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:55)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:101)
at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:87)
at org.apache.spark.rdd.RDD.$anonfun$dependencies$2(RDD.scala:264)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:260)
at org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.$anonfun$preferredLocations$2(RDD.scala:324)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:324)
at org.apache.spark.scheduler.DAGScheduler.getPreferredLocsInternal(DAGScheduler.scala:2529)
at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:2503)
at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1898)
at org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations.$anonfun$getAllPrefLocs$1(CoalescedRDD.scala:198)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations.getAllPrefLocs(CoalescedRDD.scala:197)
at org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations.<init>(CoalescedRDD.scala:190)
at org.apache.spark.rdd.DefaultPartitionCoalescer.coalesce(CoalescedRDD.scala:391)
at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:90)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:55)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.api.java.JavaRDDLike.partitions(JavaRDDLike.scala:61)
at org.apache.spark.api.java.JavaRDDLike.partitions$(JavaRDDLike.scala:61)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
from dps.
This is a spark environment issue.
You need to simple test to run spark on your local environment first.
from dps.
Related Issues (20)
- Add normalize `?,:"!` in common preprocess job
- Update additional preprocess function HOT 1
- Remove `soynlp` library
- Add pre-processing for Japanese texts
- Replace html2text from Beautifulsoup HOT 1
- Task consideration HOT 3
- Implement minhash dedup module
- [ja] replace Japanese PII
- [ja] reduce emoticon HOT 1
- [ja] spam word filter
- Japanese pre-procesesing - remove text with low rate of Japanese stopwords HOT 4
- Improve Korean preprocessing algorithm
- Need to add ignore null or empty text during korean text process
- Refactor RDD process to Dataframe process
- [ja] refactor MinHashLSH-based near deduplication method
- Chiese dedup memory error HOT 1
- [ja] `.filter` is used instead of `.map` for non-filter methods HOT 1
- Bug in the function `remove_repeated_text`
- k
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dps.