Hello, I am trying to use this library using the databricks cloud to parse a SAS f

I just pushed an update to a new branch <a class="issue-link js-issue-link" data-error

Here is the dataset I am using . <a href="https://s3.amazonaws.com/m

Issue with the version of Hadoop about spark-sas7bdat HOT 13 CLOSED

saurfang commented on August 23, 2024

Issue with the version of Hadoop

from spark-sas7bdat.

Comments (13)

commented on August 23, 2024

I also tried to run manually trying to use google cloud as my file system but its not support.Here is the attached screen shoot of the output.

from spark-sas7bdat.

commented on August 23, 2024

Here is the input that I found which may help and appears to be same problem that I am hitting with.

http://stackoverflow.com/questions/26450708/fail-to-run-spark-job-when-using-globstatus-and-google-cloud-storage-bucket-as-i

from spark-sas7bdat.

saurfang commented on August 23, 2024

Thanks for the report. Indeed the splittable inputformat was implemented for Hadoop 2 only so it only runs for Hadoop 2.x. I might look into how difficult it is to make this Hadoop 1 compatible.

For your second question, I'm not familiar with google cloud storage but I find this information could be useful for you case: https://cloud.google.com/hadoop/google-cloud-storage-connector#configuring

It sounds like the google cloud core-conf.xml is not on your classpath. If you are running with sbt, you can try place this config file under src/main/resources or if you are running as spark-shell --package option, you can add it to driver/executor extraclasspath.

https://spark.apache.org/docs/latest/configuration.html#runtime-environment

Give this a try if it doesn't work, I'll see if I can find sometime this week to replicate this issue since it's probably a lot easier than achieving Hadoop 1 compatibility.

from spark-sas7bdat.

commented on August 23, 2024

I have already setup the same core-site.xml file with the setting that are mentioned in the link you shared.When I try to do the following it works fine and I am able to access the data from the google cloud and its works fine for me .

val test=sc.textFile("gs://test-bucket/test-file/README.md")

But when I use the same with the following command I get the error

val cars = sqlContext.sasFile("gs://test-bucket/test-file/test.sas7bdat")

Bottom line is I am able to access the data from google cloud the "gs" using the other commands as above but not able to access when using sqlContext.sasfile.

So any help in this would be greatly appreciated.

I really appreciate you taking time to answer my questions.

from spark-sas7bdat.

commented on August 23, 2024

Hi,
I played it with extensively today converting the SAS files in to CSV but was not sucessful in doing this task and not sure with the reason.
When I run exactly the same program that you have provided the output created a folder with .csv and has some files called _SUCESS (which has nothing in it).
I also gave a short to convert it in to parquet files but its creating directory with that name and has two files 1-_common_metadata(this files has some data) 2-_SUCCESS(this is empty).

Does it has minimum system requirements like the RAM size and all...I am 4-core 15 GB instances..with 1 master and 2 slaves.

I am attaching the screen shots below.

output

Then I converted it

When I checked the output I had this

Could you please me in this regard.

from spark-sas7bdat.

saurfang commented on August 23, 2024

Thanks for providing these useful information. That's very odd. Since you have forked the repository, do you mind running sbt test on the terminal? Does the test pass or fail?

I see when you load the data into a dataframe, it was able to infer the schema correctly. Can you try df.first and df.count? Do they gives you the first row and number of rows correctly?

If this is not confidential data, is it possible give me a sample for me to replicate?

from spark-sas7bdat.

saurfang commented on August 23, 2024

I just pushed an update to a new branch #2 where I made this compatible with Hadoop 1
Do you know how to pull that branch and build the assembly jar so you can help me test that on the databricks cloud?
I currently don't have access to a hadoop 1 distro to test and might get one up on EC2 during the weekend if needed. (but at least all unit tests are passing in local mode)

Although given your unsuccessful attempts on this dataset, I'm a bit skeptical this change will actually solve you problem. Let me know!

from spark-sas7bdat.

commented on August 23, 2024

Here is the dataset I am using .

https://s3.amazonaws.com/mdpbi.testbucket/dsc_optima.sas7bdat

-I will try all those things that you mentioned and will update you with the results.
-I am not sure how to pull them but will work with the databricks cloud team and give it out a try.

from spark-sas7bdat.

saurfang commented on August 23, 2024

Thanks for sharing the data. Unfortunately, I wasn't able to reproduce your problem. Here is what I did:

spark-shell --master local --packages saurfang:spark-sas7bdat:1.0.0-s_2.10

brings me into the Spark shell

import com.github.saurfang.sas.spark._
val dsc_optima = sqlContext.sasFile("/Users/Forest/Downloads/dsc_optima.sas7bdat")
dsc_optima.first

prints the first line of the records

and

dsc_optima.count

returns 25000 correctly.

I also tried to save it as csv by

dsc_optima.saveAsCsvFile("dsc_optima.csv")

That finished

and I have 5 part files, which after my inspection seems to contain correct data.

Let me know if there is anything I can help. I will test the hadoop 1 implementation this weekend on EC2 so hopefully that can get you rolling on databricks cloud. Is this the full dataset if so I think you can convert it to csv/parquet file locally. (parquet is better since it preserve the "missing" information.) Are you using a mac?

from spark-sas7bdat.

saurfang commented on August 23, 2024

Just pushed a new version 1.1.0 which adds support for Hadoop1. You can try with

$SPARK_HOME/bin/spark-shell --packages saurfang:spark-sas7bdat:1.1.0-s_2.10

I tested this on EC2 with data uploaded to HDFS as well as local FS. I added a fix to pick the right filesystem so in theory this should work for s3 or gs. However I tried it with s3 and it failed. I will take another look this weekend. Meanwhile if you'd like, you can try gs and see if that works for you. Otherwise with this new release I hope you have at least more success by running this with data on HDFS on any Hadoop 1 distribution.

from spark-sas7bdat.

saurfang commented on August 23, 2024

1.1.2 now should correctly read data from s3n. so you can do

val dsc_optima = sqlContext.sasFile("s3n://mdpbi.testbucket/dsc_optima.sas7bdat")

That's if you have configuration AWS dependencies and AWS access key correctly of course. I tested it using spark-ec2 script.

from spark-sas7bdat.

arungo13 commented on August 23, 2024

Hi,
This package is really amazing to read proprietary data format. My post is something not directly related to this as I Just looking for your direction in terms of version aspect. I am also having similar scenario i.e. parsing and converting the SAS7BDAT files and looks like spark-sas7bdat package promising but I have another challenges in terms of version compatibility. We are using CDH 5.3 (cloudier) which has Spark 1.2 hence can we import & distribute relevant specific spark packages (like .jar) into our hadoop platform (rather than upgrading Spark to 1.3 as we have some other dependency), Will it work?, please help me.

from spark-sas7bdat.

saurfang commented on August 23, 2024

@arungo13 Thanks for your interests. Unfortunately, dataframe was only formally introduced in Spark 1.3. Spark being a YARN application, is pretty easy to upgrade since it has no component in terms of cluster configuration. Furthermore you can always convert your sas7bdat to parquet or csv using a Spark 1.3 spark-shell then work with Spark 1.2 instead.

That being said, you can still use the builtin Hadoop input format. See here for how to read it as a RDD[Array[Object]]. https://github.com/saurfang/spark-sas7bdat/blob/master/src/main/scala/com/github/saurfang/sas/spark/SasRelation.scala#L39

Unfortunately you will have to identify the schema and how to convert to the format you can work with yourself. If you have follow up question, feel free to open up new issue instead since yours isn't directly related to this one.

from spark-sas7bdat.

Issue with the version of Hadoop about spark-sas7bdat HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent