Giter Site home page Giter Site logo

About prepare_data.sh about mlconvgec2018 HOT 5 CLOSED

nusnlp avatar nusnlp commented on August 19, 2024
About prepare_data.sh

from mlconvgec2018.

Comments (5)

shamilcm avatar shamilcm commented on August 19, 2024 3

I think the ratio used is 9, not 1.5. The script also removes sentences which are more than 80 tokens and less than 1 token.

from mlconvgec2018.

shamilcm avatar shamilcm commented on August 19, 2024

The parallel training data (NUCLE+Lang8) is cleaned https://github.com/nusnlp/mlconvgec2018/blob/master/data/prepare_data.sh#L94 so that only non-empty sentence pairs are retained.

from mlconvgec2018.

awasthiabhijeet avatar awasthiabhijeet commented on August 19, 2024

In many <incorrect, correct> pairs of lang-8 data, there are additional comments. If we feed this to our models as is, I think it will not be very useful.
I guess the current scripts for cleaning data does not remove additional comments provided along with correct sentences.
Is there any script which handles this problem?

from mlconvgec2018.

shamilcm avatar shamilcm commented on August 19, 2024

No, the current pre-processing pipeline does not involve any specific rules to remove additional comments. However, the clean-corpus-n.perl script (from Moses SMT toolkit) that is used within the preprocess.sh script removes source-target sentence pairs which are substantially different in terms of length.

from mlconvgec2018.

awasthiabhijeet avatar awasthiabhijeet commented on August 19, 2024

Thanks,
I see. Removing source-target pairs where len(target)> 1.5*len(source) rejects around 30% of data. :(

from mlconvgec2018.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.