Giter Site home page Giter Site logo

carted / processing-text-data Goto Github PK

View Code? Open in Web Editor NEW
20.0 4.0 6.0 17 KB

Presents an optimized Apache Beam pipeline for generating sentence embeddings (runnable on Cloud Dataflow).

Home Page: https://www.carted.com/blog/improving-dataflow-pipelines-for-text-data-processing/

License: Apache License 2.0

Python 100.00%
tensorflow apache-beam dataflow text-data bert use-bert tfhub

processing-text-data's Introduction

Processing text data at scale with Apache Beam and Cloud Dataflow

Presents an optimized Apache Beam pipeline for generating sentence embeddings (runnable on Cloud Dataflow). This repository accompanies our blog post: Improving Dataflow Pipelines for Text Data Processing.

We assume that you already have a billing enabled Google Cloud Platform (GCP) project in case you wanted to run the pipeline on Cloud Dataflow.

Running the code locally

To run the code locally, first install the dependencies: pip install -r requirements. If you cannot create a Google Cloud Storage (GCS) Bucket then download the data using from here. We just need the train_data.txt file for our purpose. Also, note that without a GCS Bucket, one cannot run the pipeline on Cloud Dataflow which is the main objective of this repository.

After downloading the dataset, make changes to the respective paths and command-line arguments that use GCS in main.py.

Then execute python main.py -r DirectRunner.

Running the code on Cloud Dataflow

  1. Create a GCS Bucket and note its name.

  2. Then create a folder called data inside the Bucket.

  3. Copy over the train_data.txt file to the data folder: gsutil cp train_data.txt gs://<BUCKET-NAME>/data.

  4. Then run the following from the terminal:

    python main.py \
        --project <GCP-Project> \
        --gcs-bucket <BUCKET-NAME>
        --runner DataflowRunner

For more details please refer to our blog post: Improving Dataflow Pipelines for Text Data Processing.

processing-text-data's People

Contributors

sayakpaul avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

processing-text-data's Issues

Spurious error when the pipeline is run on Dataflow

Updated code is here: https://github.com/carted/processing-text-data/tree/dev.

The following is the error Dataflow runs into:

INFO:apache_beam.runners.dataflow.dataflow_runner:2022-02-17T11:10:13.462Z: JOB_MESSAGE_ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/dataflow_worker/batchworker.py", line 773, in run
    self._load_main_session(self.local_staging_directory)
  File "/usr/local/lib/python3.8/site-packages/dataflow_worker/batchworker.py", line 514, in _load_main_session
    pickler.load_session(session_file)
  File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/pickler.py", line 311, in load_session
    return dill.load_session(file_path)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 368, in load_session
    module = unpickler.load()
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 462, in find_class
    return StockUnpickler.find_class(self, module, name)
ModuleNotFoundError: No module named 'configs'

@Nilabhra do you see anything immediately pesky in main.py?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.