This repo contains the Spark code and HQL for Data Engineering practices.
- Hadoop Cluster https://www.databricks.com/glossary/hadoop-cluster#:~:text=Hadoop%20clusters%20are%20composed%20of,running%20on%20a%20separate%20machine.
- HIVE https://aws.amazon.com/big-data/what-is-hive/
- HDFS (Hadoop Distributed File System ) https://www.databricks.com/glossary/hadoop-distributed-file-system-hdfs#:~:text=HDFS%20(Hadoop%20Distributed%20File%20System,handle%20and%20store%20big%20data.
- Generate a key pair:
-
For Linux/Mac: Use the following command in the terminal:
ssh-keygen -t rsa -b 4096 -C "[email protected]"
This command will create a public and private key pair in the default location (~/.ssh/id_rsa and ~/.ssh/id_rsa.pub).
- For Windows: Use a tool like PuTTYgen to generate a key pair. Save the public and private keys (public_key.pem and private_key.ppk) to a safe location.
- In remote server, go to .ssh folder, create a authorized_key file and paste your public key there.
- Ensure the proper permissions are set on your private key file:
-
For Linux/Mac:
chmod 600 ~/.ssh/id_rsa
-
For Windows: In PuTTYgen, when you save the private key (private_key.ppk), the correct permissions are automatically applied.
- Connect to the edge node using your private key:
-
For Linux/Mac: Use the following command in the terminal, replacing "user" with your username and "edge_node_ip" with the IP address or hostname of the edge node:
ssh -i ~/.ssh/id_rsa user@edge_node_ip
-
For Windows: Open PuTTY and enter the edge node's IP address or hostname in the "Host Name (or IP address)" field. In the left pane, navigate to Connection > SSH > Auth, and click the "Browse" button to select your private_key.ppk file. Click "Open" to initiate the connection.
- If prompted, accept the edge node's host key by typing "yes" (Linux/Mac) or clicking "Yes" (Windows). You should now be connected to the edge node.