- Get daily revenue by product considering completed and closed orders.
- Data need to be sorted in ascending order by date and then descending
- order by revenue computed for each product for each day.
- Data for orders and order_items is available in HDFS
/public/retail_db/orders and /public/retail_db/order_items
/data/retail_db/products
- HDFS location – avro format /user/YOUR_USER_ID/daily_revenue_avro_python
- HDFS location – text format /user/YOUR_USER_ID/daily_revenue_txt_python
- Local location /home/YOUR_USER_ID/daily_revenue_python Solution need to be stored under /home/YOUR_USER_ID/daily_revenue_python.txt
- Get daily product revenue
- Orders eg. order_id, order_date, order_customer_id, order_status
- Order_items eg. order_items_id, order_items_order_id, order_items_product_id, order_items_quantity, order_items_subtotal, order_items_product_price
- Data is comma separated
- fetch data using spark.read.csv
- apply type cast function
- Details - Duration 40 minutes
- Data set URL 596
- Choose language of your choice Python or Scala
- Data is available in HDFS file system under /public/crime/csv You can check properties of files using hadoop fs -ls -h /public/crime/csv Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location
- Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)
- File format - text file
- Delimiter - “,”
- Get monthly count of primary crime type, sorted by month in ascending and number of crimes per type in descending order Store the result in HDFS path /user/<YOUR_USER_ID>/solutions/solution01/crimes_by_type_by_month
- Output File Format: TEXT
- Output Columns: Month in YYYYMM format, crime count, crime type
- Output Delimiter: \t (tab delimited)
- Output Compression: gzip
- Source directories: /data/retail_db/orders and /data/retail_db/customers
- Source delimiter: comma (“,”)
- Source Columns - orders - order_id, order_date, order_customer_id, order_status Source Columns - customers - customer_id, customer_fname, customer_lname and many more
- Get the customers who have not placed any orders, sorted by customer_lname and then customer_fname
- Target Columns: customer_lname, customer_fname
- Target Directory: /user/<YOUR_USER_ID>/solutions/solutions02/inactive_customers
- Target File Format: TEXT Target Delimiter: comma (“, ”)
- Compression: N/A
Data is available in HDFS file system under /public/crime/csv Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location) File format - text file Delimiter - “,” (use regex while splitting split(",(?=(?:[^\"]"[^\"]")[^\"]$)", -1), as there are some fields with comma and enclosed using double quotes. Get top 3 crime types based on number of incidents in RESIDENCE area using “Location Description” Store the result in HDFS path /user/<YOUR_USER_ID>/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA Output Fields: Crime Type, Number of Incidents Output File Format: JSON Output Delimiter: N/A Output Compression: No
Data is available in HDFS /public/randomtextwriter Get word count for the input data using space as delimiter (for each word, we need to get how many types it is repeated in the entire input data set) Number of executors should be 10 Executor memory should be 3 GB Executor cores should be 20 in total (2 per executor) Number of output files should be 8 Avro dependency details: groupId -> com.databricks, artifactId -> spark-avro_2.10, version -> 2.0.1 Target Directory: /user/<YOUR_USER_ID>/solutions/solution05/wordcount Target File Format: Avro Target fields: word, count Compression: N/A or default