Giter Site home page Giter Site logo

Comments (4)

gforsyth avatar gforsyth commented on June 10, 2024 1

Hey @mark-druffel -- thanks for the report! This should be fixed on main if you want to try out a pre-release. We're no longer sanitizing the inputs to format so if the underlying spark version supports writing to delta, it should "just work" since create_table is calling saveAsTable under the hood.

from ibis.

gforsyth avatar gforsyth commented on June 10, 2024 1

Hey @mark-druffel -- can you open a separate issue for that? That is definitely an oversight and we can add in ('catalog', 'database') support for create_table but it's helpful to have a separate issue to track it.

from ibis.

mark-druffel avatar mark-druffel commented on June 10, 2024

@gforsyth Thanks, the pre-release works great! I do have one other question though 😬 Let me know if this question deserves it's own issue.

TLDR

I'm wondering if it's intended that the database argument in create_table works different than the one in drop_table? create_table only accepts a str and drop_table accepts a tuple.

If I set the catalog and database via pyspark, create_table works as excepted, but I can't figure out a way to do so in my create_table, I had to do it through the pyspark session directly:

from pyspark.sql import SparkSession
import ibis
spark = SparkSession.builder.getOrCreate()
ispark = ibis.pyspark.connect(session = spark)
ispark._session.catalog.setCurrentCatalog("comms_media_dev")
ispark._session.catalog.setCurrentDatabase("dart_extensions")
ispark.create_table(name = "raw_camp_info", obj = df, overwrite = True, format="delta")

I can drop a table without accessing the pyspark session:

ispark.drop_table(name = "raw_camp_info", database=tuple(["comms_media_dev", "dart_extensions"]))

Additional Details

To drop my table I can just specify the catalog and database in my call:

from pyspark.sql import SparkSession
import ibis
spark = SparkSession.builder.getOrCreate()
ispark = ibis.pyspark.connect(session = spark)
ispark.drop_table(name = "raw_camp_info", database=tuple(["comms_media_dev", "dart_extensions"]))

Trying the same approach with create_table if fails:

ispark.create_table(name = "raw_camp_info", obj = df, overwrite = True, format="delta", database=tuple(["comms_media_dev", "dart_extensions"]))

py4j.Py4JException: Method setCurrentDatabase([class java.util.ArrayList]) does not exist

I also tried with dot separator:

ispark.create_table(name = "raw_camp_info", obj = df, overwrite = True, format="delta", database="comms_media_dev.dart_extensions")

AnalysisException: too many name parts for schema name: comms_media_dev.dart_extensions

I then tried to set the catalog and provide the database name and got a permissions error. Looking through the error, it looks like create_table didn't pass the database argument because the database was set to default (i.e. comms_media_dev.default:

ispark._session.catalog.setCurrentCatalog("comms_media_dev")
ispark.create_table(name = "raw_camp_info", obj = df, overwrite = True, format="delta", database="dart_extensions")

com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User does not have USE SCHEMA on Schema 'comms_media_dev.default'.

Py4JJavaError Traceback (most recent call last)
File :10
8 df = ispark.read_parquet(source = "abfss://my_parquet")
9 df = df.select(_.CAMP_START_DATE, _.CAMP_END_DATE, _.KPM_PROJECT_ID)
---> 10 ispark.create_table(name = "raw_camp_info", obj = df, overwrite = True, format="delta", database="dart_extensions")

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c1f2e765-cb45-4a96-a1dc-bd25ecd4f460/lib/python3.9/site-packages/ibis/backends/pyspark/init.py:453, in Backend.create_table(self, name, obj, schema, database, temp, overwrite, format)
451 self._run_pre_execute_hooks(table)
452 df = self._session.sql(query)
--> 453 df.write.saveAsTable(name, format=format, mode=mode)
454 elif schema is not None:
455 schema = PySparkSchema.from_ibis(schema)

File /usr/lib/python3.9/contextlib.py:124, in _GeneratorContextManager.exit(self, type, value, traceback)
122 if type is None:
123 try:
--> 124 next(self.gen)
125 except StopIteration:
126 return False

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c1f2e765-cb45-4a96-a1dc-bd25ecd4f460/lib/python3.9/site-packages/ibis/backends/pyspark/init.py:234, in Backend._active_database(self, name)
232 yield
233 finally:
--> 234 self._session.catalog.setCurrentDatabase(current)

File /databricks/spark/python/pyspark/instrumentation_utils.py:48, in _wrap_function..wrapper(*args, **kwargs)
46 start = time.perf_counter()
47 try:
---> 48 res = func(*args, **kwargs)
49 logger.log_success(
50 module_name, class_name, function_name, time.perf_counter() - start, signature
51 )
52 return res

File /databricks/spark/python/pyspark/sql/catalog.py:185, in Catalog.setCurrentDatabase(self, dbName)
172 def setCurrentDatabase(self, dbName: str) -> None:
173 """
174 Sets the current default database in this session.
175
(...)
183 >>> spark.catalog.setCurrentDatabase("default")
184 """
--> 185 return self._jcatalog.setCurrentDatabase(dbName)

File /databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py:1321, in JavaMember.call(self, *args)
1315 command = proto.CALL_COMMAND_NAME +
1316 self.command_header +
1317 args_command +
1318 proto.END_COMMAND_PART
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1324 for temp_arg in temp_args:
1325 temp_arg._detach()

File /databricks/spark/python/pyspark/errors/exceptions.py:228, in capture_sql_exception..deco(*a, **kw)
226 def deco(*a: Any, **kw: Any) -> Any:
227 try:
--> 228 return f(*a, **kw)
229 except Py4JJavaError as e:
230 converted = convert_exception(e.java_exception)

File /databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o569.setCurrentDatabase.
: com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User does not have USE SCHEMA on Schema 'comms_media_dev.default'.
at com.databricks.managedcatalog.UCReliableHttpClient.reliablyAndTranslateExceptions(UCReliableHttpClient.scala:84)
at com.databricks.managedcatalog.UCReliableHttpClient.get(UCReliableHttpClient.scala:136)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$getSchema$1(ManagedCatalogClientImpl.scala:378)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$recordAndWrapException$2(ManagedCatalogClientImpl.scala:3401)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.$anonfun$recordAndWrapException$1(ManagedCatalogClientImpl.scala:3400)
at com.databricks.managedcatalog.ErrorDetailsHandler.wrapServiceException(ErrorDetailsHandler.scala:25)
at com.databricks.managedcatalog.ErrorDetailsHandler.wrapServiceException$(ErrorDetailsHandler.scala:23)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.wrapServiceException(ManagedCatalogClientImpl.scala:106)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.recordAndWrapException(ManagedCatalogClientImpl.scala:3397)
at com.databricks.managedcatalog.ManagedCatalogClientImpl.getSchema(ManagedCatalogClientImpl.scala:374)
at com.databricks.sql.managedcatalog.ManagedCatalogCommon.$anonfun$getSchemaMetadata$5(ManagedCatalogCommon.scala:227)
at scala.Option.getOrElse(Option.scala:189)
at com.databricks.sql.managedcatalog.ManagedCatalogCommon.getSchemaMetadata(ManagedCatalogCommon.scala:227)
at com.databricks.sql.managedcatalog.ManagedCatalogCommon.schemaExists(ManagedCatalogCommon.scala:233)
at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.databaseExists(ManagedCatalogSessionCatalog.scala:631)
at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.requireScExists(ManagedCatalogSessionCatalog.scala:274)
at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.setCurrentDatabase(ManagedCatalogSessionCatalog.scala:492)
at com.databricks.sql.DatabricksCatalogManager.setCurrentNamespace(DatabricksCatalogManager.scala:159)
at org.apache.spark.sql.internal.CatalogImpl.setCurrentDatabase(CatalogImpl.scala:98)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)

If I set the catalog and database via pyspark, create_table works as excepted:

ispark._session.catalog.setCurrentCatalog("comms_media_dev")
ispark._session.catalog.setCurrentDatabase("dart_extensions")
ispark.create_table(name = "raw_camp_info", obj = df, overwrite = True, format="delta")

from ibis.

mark-druffel avatar mark-druffel commented on June 10, 2024

Thanks for all your help @gforsyth!

from ibis.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.