Giter Site home page Giter Site logo

migrate's People

Contributors

adiospeds avatar ahoughro avatar aly76 avatar andre-sasaki avatar andrepibo-db avatar boydzd avatar carrossoni avatar devninja5 avatar geeksheikh avatar gogi2811 avatar gregwood-db avatar hwang-db avatar kevinkim9264 avatar koernigo avatar kr-arjun avatar lorenzorubi-db avatar lverzosa avatar lyliyu avatar marekmodry avatar mkazia-db avatar mrchristine avatar nfx avatar nmoorenz avatar paulg-mscience avatar ravirahul-db avatar sergiu-databricks avatar souravkhandelwal09 avatar stikkireddy avatar veenaramesh avatar yubing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

migrate's Issues

[import] --users: cannot import admin group

Import cannot re-create admins group:

Creating group: admins
Error: b'{"schemas":["urn:ietf:params:scim:api:messages:2.0:Error"],"detail":"Group with name admins already exists.","status":"409"}\n'

therefore it should not be exported.

Consider pulling the configure command into this migration CLI

Having to switch between CLIs just to configure tokens may not be needed.

Since integration with click we can simply just add:

from databricks_cli.configure.cli import configure_cli
cli.add_command(configure_cli, name='configure')

This should pull in the configure command into this cli but also be compatible with the CLI at the same time.

Let me know if anyone has concerns of both tools having the configure command.

[import] --clusters: got an unexpected keyword argument 'instance_pool_id'

Cannot import clusters after export --clusters:

➜  migrate-sample databricks-migrate import --profile sandbox  --clusters
Executing action: 'start importing databricks objects' at time: 2020-10-02 11:23:59.949141
Executing action: 'import instance profiles' at time: 2020-10-02 11:23:59.951035
Duration to execute function 'import instance profiles': 0:00:00.426867
Executing action: 'import instance pools' at time: 2020-10-02 11:24:00.378298
Error: TypeError: create_instance_pool() got an unexpected keyword argument 'instance_pool_id'

Support a --overwrite option for --import-home

The workspace databricks REST API supports an overwrite option for workspace artifacts

But the arguments passed to import objects from a prior export are defaults and in such a case, the "overwrite" flag remains false.

An --overwrite option would be quite beneficial here.

Import metastore - NameError: name 're' is not defined

Looks like it is needed to add "import re" to HiveClient.py

File "/home/hadoop/databricks/migrate/dbclient/HiveClient.py", line 48, in get_path_option_if_available
params = re.search(r'((.*?))', stmt).group(1)
NameError: name 're' is not defined

[import] --workspace: cannot import notebooks after export --workspace --download

Error output:

➜  databricks-migrate export --profile sandbox  --workspace --download
Executing action: 'start exporting databricks objects' at time: 2020-10-02 11:16:41.704741
Executing action: 'export workspace object configs' at time: 2020-10-02 11:16:41.706632
Duration to execute function 'export workspace object configs': 0:00:33.061810
Executing action: 'download workspace objects' at time: 2020-10-02 11:17:14.769287
Duration to execute function 'download workspace objects': 0:00:00.971945
Duration to execute function 'start exporting databricks objects': 0:00:34.038299

➜  databricks-migrate import --profile sandbox-e2  --workspace        
Executing action: 'start importing databricks objects' at time: 2020-10-02 11:17:33.095767
Executing action: 'import all the workspace objects' at time: 2020-10-02 11:17:33.097597
Uploading: Test.dbc
Uploading: r.dbc
Uploading: python.dbc
Uploading: scala.dbc
Uploading: sql.dbc
Error: b'{"error_code":"RESOURCE_DOES_NOT_EXIST","message":"Path (/Users/[email protected]) doesn\'t exist."}'

Metastore Export does not respect column constraints.

Issue description

When exporting databases from the metastore, the table definitions generated do not have constraints such as 'is not null' defined for columns that have them in the original.

Steps to reproduce the issue

  1. In Workspace A define a table with a column constraint, such as 'is not null'.
  2. Export the database from workspace A.
  3. Attempt to import the database to Workspace B.

What's the expected result?

  • The database and table will be created.

What's the actual result?

  • An error, explaining that the schema expected does not match schema that already exist, because the existing schema has a constraint and the target schema does not.

Additional details / screenshot

  • Screen Shot 2021-08-10 at 4 30 33 PM
  • Screen Shot 2021-08-10 at 4 21 08 PM

Metastore Views May Fail if Tables Do Not Exist

When applying metastore DDL during view creation, it could fail because the underlying table does not exist.

We should follow an order of operation:

  1. Create all table DDLs
  2. Create all views

Workaround: Users can work around this issue by running the import twice.
First pass will create the tables, second pass will skip failed tables and be able to create the views.

Namespace Error

Hi, we had an azure dev-ops pipeline working fine and recently our migration pipeline is failing with the following error on exporting metadata. It is using cluster with the 7.3 LTS runtime. The pipeline has worked successfully before and also works using the templated cluster

Azure Dev-ops error:

ERROR:
AttributeError: namespace
{"resultType": "error", "summary": "<span class="ansi-red-fg">AttributeError: namespace", "cause": "---------------------------------------------------------------------------\nValueError Traceback (most recent call last)\n/databricks/spark/python/pyspark/sql/types.py in getattr(self, item)\n 1594 # but this will not be used in normal cases\n-> 1595 idx = self.fields.index(item)\n 1596 return self[idx]\n\nValueError: 'namespace' is not in list\n\nDuring handling of the above exception, another exception occurred:\n\nAttributeError Traceback (most recent call last)\n in \n----> 1 all_dbs = [x.namespace for x in spark.sql("show databases").collect()]; print(len(all_dbs))\n\n in (.0)\n----> 1 all_dbs = [x.namespace for x in spark.sql("show databases").collect()]; print(len(all_dbs))\n\n/databricks/spark/python/pyspark/sql/types.py in getattr(self, item)\n 1598 raise AttributeError(item)\n 1599 except ValueError:\n-> 1600 raise AttributeError(item)\n 1601 \n 1602 def setattr(self, key, value):\n\nAttributeError: namespace"}
2021-03-15T16:33:09.1574639Z

----------------------------- Driver Log ---------------------------------------------

ValueError Traceback (most recent call last)
/databricks/spark/python/pyspark/sql/types.py in getattr(self, item)
1594 # but this will not be used in normal cases
-> 1595 idx = self.fields.index(item)
1596 return self[idx]

ValueError: 'namespace' is not in list

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last)
in
----> 1 all_dbs = [x.namespace for x in spark.sql("show databases").collect()]; print(len(all_dbs))

in (.0)
----> 1 all_dbs = [x.namespace for x in spark.sql("show databases").collect()]; print(len(all_dbs))

/databricks/spark/python/pyspark/sql/types.py in getattr(self, item)
1598 raise AttributeError(item)
1599 except ValueError:
-> 1600 raise AttributeError(item)
1601
1602 def setattr(self, key, value):

AttributeError: namespace

trailing space in notebook name causes download failure

I ran export_db.py --workspace no problem, but export_db.py --download hits a snag saying:

. . .
Downloading: /Users/[email protected]/Completed Work/Dave Testing/Threading testing
Get: https://bethesda2.cloud.databricks.com/api/2.0/workspace/export
Traceback (most recent call last):
  File "./migrate/export_db.py", line 297, in <module>
    main()
  File "./migrate/export_db.py", line 81, in main
    num_notebooks = ws_c.download_notebooks()
  File "/Users/neil.best/Documents/GitHub/BethesdaNet/zmi-bi-databricks/migrate/dbclient/WorkspaceClient.py", line 229, in download_notebooks
    dl_resp = self.download_notebook_helper(notebook_path, export_dir=self.get_export_dir() + ws_dir)
  File "/Users/neil.best/Documents/GitHub/BethesdaNet/zmi-bi-databricks/migrate/dbclient/WorkspaceClient.py", line 256, in download_notebook_helper
    save_filename = save_path + os.path.basename(notebook_path) + '.' + resp.get('file_type')
TypeError: must be str, not NoneType

That last notebook name had a trailing space:
'.../Threading testing ' 👈

The workaround was to fix it at the source and in the migrate log.

[import] --users: run should not be halted on instance profile errors

whenever instance profiles could not be imported, application should log a warning and not stop completely, allowing operation resume in the later steps.

➜  migrate-sample databricks-migrate import --profile sandbox-e2 --users
Executing action: 'start importing databricks objects' at time: 2020-10-02 10:56:43.340801
Executing action: 'import instance profiles' at time: 2020-10-02 10:56:43.343152
Importing arn: arn:aws:iam::xxx:instance-profile/yyy

Error: b'{"error_code":"DRY_RUN_FAILED","message":"Verification of the instance profile failed. AWS error: Value (arn:aws:iam::xxx:instance-profile/yyy) for parameter iamInstanceProfile.arn is invalid. Invalid IAM Instance Profile ARN"}'

Support migration of multi task jobs

While doing the Jobs import (python export_db.py --profile DEMO --jobs -> python import_db.py --profile NEW-DEMO --jobs) I ran into:

Traceback (most recent call last):
  File "import_db.py", line 219, in <module>
    main()
  File "import_db.py", line 105, in main
    jobs_c.import_job_configs()
  File "/home/terry/databricks/migrate/dbclient/JobsClient.py", line 110, in import_job_configs
    cluster_conf = job_settings['new_cluster']
KeyError: 'new_cluster'

Looking at the code, it seems that it parses the JSON from the jobs.log file and expects there to be a 'new_cluster' key, which for MULTI_TASK may not be the case.

For reference, here's what my multi-task job's JSON looked like:

  {
    "job_id": 1234,
    "settings": {
      "name": "Example:::1234",
      "email_notifications": {
        "on_failure": ["[email protected]"],
        "no_alert_for_skipped_runs": true
      },
      "timeout_seconds": 0,
      "schedule": {
        "quartz_cron_expression": "51 30 23 * * ?",
        "timezone_id": "Australia/Brisbane",
        "pause_status": "UNPAUSED"
      },
      "max_concurrent_runs": 1,
      "format": "MULTI_TASK"
    },
    "created_time": 1628241879369,
    "creator_user_name": "[email protected]"
  }

secret scope names containing slashes cause failure

databricks secrets list-scopes --profile bethesda2

returns:

Scope                                          Backend     KeyVault URL
---------------------------------------------  ----------  --------------
adaptive-planning-scope                        DATABRICKS  N/A
BI/Databricks/ForumUsers                       DATABRICKS  N/A
. . .

Secrets migration:

python ./migrate/export_db.py --secrets \                                                                                       (zmi-bi-databricks)
             --profile bethesda2 \
             --cluster-name "E2 migration" \
             --set-export-dir workflow/results/migrate/bethesda2/secrets

dies like this:

Get: https://bethesda2.cloud.databricks.com/api/2.0/secrets/list
Traceback (most recent call last):
  File "./migrate/export_db.py", line 297, in <module>
    main()
  File "./migrate/export_db.py", line 183, in main
    sc.log_all_secrets(args.cluster_name)
  File "/Users/neil.best/Documents/GitHub/BethesdaNet/zmi-bi-databricks/migrate/dbclient/SecretsClient.py", line 46, in log_all_secrets
    with open(scopes_logfile, 'w') as fp:
FileNotFoundError: [Errno 2] No such file or directory: 'workflow/results/migrate/bethesda2/secrets/secret_scopes/BI/Databricks/ForumUsers'

Export directory contains only:

find workflow/results/migrate/bethesda2/secrets -type f
workflow/results/migrate/bethesda2/secrets/secret_scopes/adaptive-planning-scope

The first scope succeeded but it looks like those slashes in the scope name need to be escaped or translated somehow for it to proceed.

while importing cluster from source to target databricks workspace, I am getting ACL KeyError.

I am exporting a Standard databricks azure cluster to another Standard databricks azure cluster. I am getting the below error:
Traceback (most recent call last):
File "import_db.py", line 219, in
main()
File "import_db.py", line 97, in main
cl_c.import_cluster_configs()
File "/home//migrate/dbclient/ClustersClient.py", line 269, in import_cluster_configs
acl_args = {'access_control_list' : self.build_acl_args(data['access_control_list'])}
KeyError: 'access_control_list'

ACL_clusters.log file showed the following error:
{"error_code": "INTERNAL_ERROR", "message": "FEATURE_DISABLED: ACLs for cluster are disabled or not available in this tier", "http_status_code": 500}

Is there a way to import without ACL?

Issue with tools using it for GCP imports:

got some updates from (who is leading Databricks-GCP migration efforts):

1. Users/Groups - successfully migrated, but saw the command failed at the end:

Importing group admins :
Traceback (most recent call last):
  File "import_db.py", line 219, in <module>
    main()
  File "import_db.py", line 41, in main
    scim_c.import_all_users_and_groups()
  File "/Users/pnaik/Source/Databricks_Migration/migrate/dbclient/ScimClient.py", line 437, in import_all_users_and_groups
    self.import_groups(group_dir)
  File "/Users/pnaik/Source/Databricks_Migration/migrate/dbclient/ScimClient.py", line 403, in import_groups
    old_email = old_user_emails[m['value']]
KeyError: '100000'

2. Notebook - download gave the below error:

Downloading: /Conviva3D/DataManagement/BugsAndProblems/B&P-01-20160628-Cannot-Save-PbRl.scala
Get: https://conviva.cloud.databricks.com/api/2.0/workspace/export
Downloading: /Conviva3D/DataManagement/BugsAndProblems/Clone of B&P-01-20160628-Cannot-Save-PbRl.scala
Get: https://conviva.cloud.databricks.com/api/2.0/workspace/export
Traceback (most recent call last):
  File "export_db.py", line 297, in <module>
    main()
  File "export_db.py", line 81, in main
    num_notebooks = ws_c.download_notebooks()
  File "/Users/pnaik/Source/Databricks_Migration/migrate/dbclient/WorkspaceClient.py", line 229, in download_notebooks
    dl_resp = self.download_notebook_helper(notebook_path, export_dir=self.get_export_dir() + ws_dir)
  File "/Users/pnaik/Source/Databricks_Migration/migrate/dbclient/WorkspaceClient.py", line 256, in download_notebook_helper
    save_filename = save_path + os.path.basename(notebook_path) + '.' + resp.get('file_type')
TypeError: can only concatenate str (not "NoneType") to str

Looks like API response parsing needs a tuning in https://github.com/databrickslabs/migrate/blob/master/dbclient/WorkspaceClient.py @mwc

PAT token is logged in clear text

Credentials must not be logged in clear text, otherwise it's a security risk.

databricks-migrate test_connection --profile sandbox
Testing connection at https://XXX/api/2.0 with headers: {'Authorization': 'Bearer XXX', 'Content-Type': 'text/json', 'user-agent': 'databricks-cli-0.11.0-test_connection-e372c092-0488-11eb-8cb1-a683e7c44a96'}

Delta Table Metastore Entries Fail

Delta tables will fail to import due to TBLPROPERTIES and OPTIONS keywords that are not required during replay of the table DDL.

The fix will strip these properties during the import without affecting the export data.

[import] --skip-failed does not skip resources that already exist

databricks-migrate import --profile sandbox --clusters --libs --skip-failed --workspace       
Executing action: 'start importing databricks objects' at time: 2020-10-02 11:38:19.551181
Executing action: 'import all the workspace objects' at time: 2020-10-02 11:38:19.554536
Uploading: Test.dbc
Error: b'{"error_code":"RESOURCE_ALREADY_EXISTS","message":"Path (/Test) already exists."}'

Fix Views dependant on other views migration

We have done most of the migration and it was quite smooth but the migration of the Metastore is giving us trouble. Though most of the tables got migrated properly, the views are not when the view has dependency on another view. You can find below an example error:
Importing view XYZdatalakeproduction.last_moe_XYZ_assignment_submissions_current_academic_year_view - Not found
ERROR:

<span class="ansi-red-fg">AnalysisException</span>: View `XYZdatalakeproduction`.`last_XYZ_assignments_submissions_view` already exists. If you want to update the view definition, please use ALTER VIEW AS or CREATE OR REPLACE VIEW AS
{'resultType': 'error', 'summary': '<span class="ansi-red-fg">AnalysisException</span>: View `XYZdatalakeproduction`.`last_XYZ_assignments_submissions_view` already exists. If you want to update the view definition, please use ALTER VIEW AS or CREATE OR REPLACE VIEW AS', 'cause': '---------------------------------------------------------------------------\nAnalysisException                         Traceback (most recent call last)\n<command--1> in <module>\n----> 1 with open("/dbfs/tmp/migration/tmp_import_ddl.txt", "r") as fp: tmp_ddl = fp.read(); spark.sql(tmp_ddl)\n\n/databricks/spark/python/pyspark/sql/session.py in sql(self, sqlQuery)\n    775         [Row(f1=1, f2=\'row1\'), Row(f1=2, f2=\'row2\'), Row(f1=3, f2=\'row3\')]\n    776         """\n--> 777         return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)\n    778 \n    779     def table(self, tableName):\n\n/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)\n   1302 \n   1303         answer = self.gateway_client.send_command(command)\n-> 1304         return_value = get_return_value(\n   1305             answer, self.gateway_client, self.target_id, self.name)\n   1306 \n\n/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)\n    121                 # Hide where the exception came from that shows a non-Pythonic\n    122                 # JVM exception message.\n--> 123                 raise converted from None\n    124             else:\n    125                 raise\n\nAnalysisException: View `XYZdatalakeproduction`.`last_XYZ_assignments_submissions_view` already exists. If you want to update the view definition, please use ALTER VIEW AS or CREATE OR REPLACE VIEW AS'}

Updated Readme

After using the databricks Migration tool, I thought I could help others by adding in some additional information to the readme file.

Add Restart Option for Workspace Notebook Imports

Notebooks can be time consuming to import, and it restarts from the beginning in the event of a failure.

The better approach is to keep a log of successful imports / and an index to start off where we left off.

pausing/unpausing jobs should consider exported state

Some job schedules are paused on purpose prior to migration and should remain as such. My reading of pause_all_jobs() here says that pause_status is toggled blindly according to only the option given and the current state in the --profile workspace; no other state is considered. My customer pointed this out to me immediately as unhelpful.

[import] --workspace: cannot import workspace without downloaded notebooks

Fails to import freshly databricks-migrate export --profile sandbox-e2 --workspace:

databricks-migrate import --profile sandbox-e2  --workspace
Executing action: 'start importing databricks objects' at time: 2020-10-02 11:13:18.874860
Executing action: 'import all the workspace objects' at time: 2020-10-02 11:13:18.877754
Error: FileNotFoundError: [Errno 2] No such file or directory: 'logs/artifacts/Users'

[export] --metastore: No such file or directory: '/usr/local/lib/python3.8/site-packages/databricks_migrate/migrations/../data/aws_cluster.json'

Cannot export metastore:

databricks-migrate export --profile sandbox --metastore 
Executing action: 'start exporting databricks objects' at time: 2020-10-02 11:29:51.939548
Executing action: 'export metastore configs' at time: 2020-10-02 11:29:51.941965
Error: FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.8/site-packages/databricks_migrate/migrations/../data/aws_cluster.json'

Support workspace-acls for single user export.

When a user executes a command like this.
python export_db.py --profile demo --azure --export-home [email protected] --workspace-acls

It throws the error below. When the workspace-acls option is not used, no ACLs are exported.

Traceback (most recent call last):
  File "export_db.py", line 297, in <module>
    main()
  File "export_db.py", line 72, in main
    ws_c.log_all_workspace_acls()
  File "/databricks/driver/git1/migrate/build/lib/dbclient/WorkspaceClient.py", line 388, in log_all_workspace_acls
    self.log_acl_to_file('notebooks', workspace_log_file, 'acl_notebooks.log', 'failed_acl_notebooks.log')
  File "/databricks/driver/git1/migrate/build/lib/dbclient/WorkspaceClient.py", line 364, in log_acl_to_file
    with open(read_log_path, 'r') as read_fp, open(write_log_path, 'w') as write_fp, \
FileNotFoundError: [Errno 2] No such file or directory: 'azure_logs/user_workspace.log'

Tables are blank/empty after migration External and dbfs

failed_metastore log below:

{"resultType": "error", "summary": "java.lang.UnsupportedOperationException: Parquet does not support decimal. See HIVE-6384", "cause": "---------------------------------------------------------------------------\nPy4JJavaError Traceback (most recent call last)\n in \n----> 1 ddl_str = spark.sql("show create table default.campaign_master_table").collect()[0][0]\n\n/databricks/spark/python/pyspark/sql/session.py in sql(self, sqlQuery)\n 834 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]\n 835 """\n--> 836 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)\n 837 \n 838 @SInCE(2.0)\n\n/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args)\n 1255 answer = self.gateway_client.send_command(command)\n 1256 return_value = get_return_value(\n-> 1257 answer, self.gateway_client, self.target_id, self.name)\n 1258 \n 1259 for temp_arg in temp_args:\n\n/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)\n 61 def deco(*a, **kw):\n 62 try:\n---> 63 return f(*a, **kw)\n 64 except py4j.protocol.Py4JJavaError as e:\n 65 s = e.java_exception.toString()\n\n/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)\n 326 raise Py4JJavaError(\n 327 "An error occurred while calling {0}{1}{2}.\n".\n--> 328 format(target_id, ".", name), value)\n 329 else:\n 330 raise Py4JError(\n\nPy4JJavaError: An error occurred while calling o218.sql.\n: java.lang.UnsupportedOperationException: Parquet does not support decimal. See HIVE-6384\n\tat org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:102)\n\tat org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.(ArrayWritableObjectInspector.java:60)\n\tat org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)\n\tat org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)\n\tat org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)\n\tat org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194)\n\tat org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$getRawTableOption(HiveClientImpl.scala:420)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$tableExists$1.apply$mcZ$sp(HiveClientImpl.scala:424)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$tableExists$1.apply(HiveClientImpl.scala:424)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$tableExists$1.apply(HiveClientImpl.scala:424)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:331)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$retryLocked$1.apply(HiveClientImpl.scala:239)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$retryLocked$1.apply(HiveClientImpl.scala:231)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:275)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:231)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:314)\n\tat org.apache.spark.sql.hive.client.HiveClientImpl.tableExists(HiveClientImpl.scala:423)\n\tat org.apache.spark.sql.hive.client.PoolingHiveClient$$anonfun$tableExists$1.apply(PoolingHiveClient.scala:275)\n\tat org.apache.spark.sql.hive.client.PoolingHiveClient$$anonfun$tableExists$1.apply(PoolingHiveClient.scala:274)\n\tat org.apache.spark.sql.hive.client.PoolingHiveClient.withHiveClient(PoolingHiveClient.scala:112)\n\tat org.apache.spark.sql.hive.client.PoolingHiveClient.tableExists(PoolingHiveClient.scala:274)\n\tat org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:903)\n\tat org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:903)\n\tat org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:903)\n\tat org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$withClient$1$$anonfun$apply$1.apply(HiveExternalCatalog.scala:144)\n\tat org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$maybeSynchronized(HiveExternalCatalog.scala:105)\n\tat org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$withClient$1.apply(HiveExternalCatalog.scala:142)\n\tat com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:372)\n\tat com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:358)\n\tat com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)\n\tat org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:140)\n\tat org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:902)\n\tat org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:147)\n\tat org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:445)\n\tat org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:205)\n\tat org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:458)\n\tat com.databricks.sql.DatabricksSessionCatalog.getTableMetadata(DatabricksSessionCatalog.scala:80)\n\tat org.apache.spark.sql.execution.command.ShowCreateTableCommand.run(tables.scala:1014)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)\n\tat org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:218)\n\tat org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:218)\n\tat org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3501)\n\tat org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3496)\n\tat org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1$$anonfun$apply$1.apply(SQLExecution.scala:112)\n\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:233)\n\tat org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:98)\n\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)\n\tat org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:74)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:185)\n\tat org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3496)\n\tat org.apache.spark.sql.Dataset.(Dataset.scala:218)\n\tat org.apache.spark.sql.Dataset$$anonfun$ofRows$2.apply(Dataset.scala:91)\n\tat org.apache.spark.sql.Dataset$$anonfun$ofRows$2.apply(Dataset.scala:88)\n\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)\n\tat org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)\n\tat org.apache.spark.sql.SparkSession$$anonfun$sql$1.apply(SparkSession.scala:693)\n\tat org.apache.spark.sql.SparkSession$$anonfun$sql$1.apply(SparkSession.scala:688)\n\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)\n\tat org.apache.spark.sql.SparkSession.sql(SparkSession.scala:688)\n\tat sun.reflect.GeneratedMethodAccessor232.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)\n\tat py4j.Gateway.invoke(Gateway.java:295)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:251)\n\tat java.lang.Thread.run(Thread.java:748)", "table": "default.campaign_master_table"}

Parallelize API calls

Hi!

I used this tool recently to migrate some larger user homes with nearly a thousand notebooks. I noticed that the whole download/upload part is singe-threaded as the rest of the application. This makes the process very slow, for me it took more than an hour to download a single user home, compared to a few (<6) minutes when I parallelized the download part. It was something like this:

from joblib import Parallel, delayed
...
def download_notebooks(self, ws_log_file='user_workspace.log', ws_dir='artifacts/'):
    ...
            responses = Parallel(n_jobs=16)(delayed(self.download_notebook_helper)(nd, self.get_export_dir() + ws_dir) for nd in fp)
            for dl_resp in responses:
                if 'error_code' not in dl_resp:
                    num_notebooks += 1

# param notebook_path changed to notebook_data
def download_notebook_helper(self, notebook_data, export_dir='artifacts/'):
    ...
    # moved from the original invocation in download_notebooks
    notebook_path = json.loads(notebook_data).get('path', None).rstrip('\n')
    ...

This could be done in multiple places where a lot of API requests are issued. Some of these are a bit harder to parallelize as they're recursive (like log_all_workspace_items), but that's also not impossible I think.

I'm not a python professional or neither someone who fully understands all parts of the tool now so I cannot really open a meaningful PR from this, but I'd be very happy if such improvements could be done.

Thanks!

[export] --jobs: KeyError: 'jobs'

Cannot export jobs:

databricks-migrate export --profile sandbox --jobs     
Executing action: 'start exporting databricks objects' at time: 2020-10-02 11:27:08.517580
Executing action: 'export job configs' at time: 2020-10-02 11:27:08.520545
Error: KeyError: 'jobs'

Skip Workspace ACLs for Missing Users

We should skip applying ACLs for users workspace that no longer exist.

  1. Log all the users / paths that are skipped
  2. Load the log as a lookup table
  3. Check lookup table before re-apply ACLs

Need to consider if this is overall faster / slower vs a brute force approach of applying the ACLs + getting an error.

The job doesn't fail today, it just throws an error and keeps going. Therefore it would be an enhancement for improved UX.

Group membership for users is not imported

Groups are created while import of ./groups folder (ScimClient.py - import_groups) using iteration over groups = self.listdir(group_dir), here "groups" is a generator object
But then there is another iteration over the same generator "group", which should import group membership for users. But generator has already iterated over all files in ./groups folder, so no iteration happens, so no group membership for users is impported.
I just reinitalized a generator before next iteration and it worked for me:
Screenshot 2021-07-29 142940

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.