Giter Site home page Giter Site logo

scrappy's Introduction

How to automatically assigns tags to these events.

In order to classify each and every scraped event to the respective tag it belongs, lets throw some light on what we are trying to achive here. Lets say I am avid puzzle solver and jigsaw puzzles are something what i love. I have collected thousands of puzzle sets since my childhood, and over the time things are cluster F**ng mess and jigsaw pizzes are scatterd all across my home. Now I intend to keep seprate boxes for all the 1000 puzzle sets each containing jigsaw beices belonging to that puzzle. So, I pick up a piece of puzzle and I have to take decision right then which box is most likely to contain this piece. I will start with analyzing the puzzle with features such as color, size of peice , etcs which help me identify and build a mental model which puzzle is this peice most likely to fit in. Lets say puzzle piece I picked is blue in colors and there is pirate flag visible on that piece , so I can catagorized this puzzle piece as a part of Sea Voyage Puzzle. My likely hood of classifying each piece will increase with the ammount of experience I had with the puzzles.

Since the goal in information retrieval is to find the best document given a query, one could decide to model the probability of a document given particular query. Hence, in the puzzle example mentioned above , this would mean we are trying to do direct mapping from appearnce puzzle peiece to a box is needed. We need to calculate of likelyhood of the box given a piece or in this case 'extracted_event' to tag.

example P(Sports| cheered) ~ higer probability P(Politics| cheered) ~ not very high likelyhood

Hence for each and every scraped event , we have to do NER on that event to find even referencing phrases

Event Referring Phrases: • Useful to display in connection with events

  • E.g. “iPhone” + “launched” + “October 6” • Helpful in categorizing Events into Types

Examples: Apple to Announce iPhone 5 on October 4th! YES! iPhone 5 announcement coming Oct 4th WOOOHOO NEW IPHONE TODAY! CAN’T WAIT!

Categorizing Event Types

• Would like to categorize events into types, for example:

  • Sports
  • Politics
  • Music

Classifying Events Challenges :

• Many Different Types • Not sure what is the right set of types • Set of types might change – Might start talking about different things – Might want to focus on different groups of users

Annotating Named Entities

tweeted_on : [20th November 2016] date [Kanye West]p ruined [MTV Europe Music Awards]event [2016]occurence date In annotating NER , we have to do this either manualy or for specific domain which enables the System to have a training set to refer to, when scraping new events. But there is lot of manual work in this approach.

Unsupervised Event Type Induction

• Latent Variable Models

  • Generative Probabilistic Models ( the one I talked about at the begining)

• Advantages:

  • Discovers types which match the data
  • No need to annotate individual events
  • Don’t need to commit to a specific set of types
  • Modular, can integrate into various applications

However Each Event Phrase is modeled as a mixture of types, here we do probilistic determination of each named entity, to identify mapping of that entity to specific tag P(SPORTS|cheered)= 0.6 P(POLITICS|cheered)= 0.4

Each Event Type phrase is associated with modeled as a distribution mixture of other Entities, Types and Dates

I have added a python script, which I have written earlier to do NER on Tweets https://github.com/aregee/scrappy/blob/master/ner.py

The python code helps to classify tweets in the following categories: ['LOCATION', 'PERSON', 'FACILITY', 'GSP', 'ORGANIZATION', 'GPE'] Below is data produced by the script, on the recent tweets near my geo location

{
  "data": {
    "hashtags": [
      [
        "#Property",
        3
      ],
      [
        "#india",
        3
      ],
      [
        "#Noida",
        3
      ],
      [
        "#Flat",
        3
      ],
      [
        "#NewDelhi",
        2
      ],
      [
        "#Residential",
        2
      ],
      [
        "#Gurgaon",
        2
      ],
      [
        "#ForRent",
        2
      ],
      [
        "#WowMSG2MSG2",
        2
      ],
      [
        "#Resi",
        1
      ],
      [
        "#airlifton22ndjan",
        1
      ],
      [
        "#SaritaVihar,",
        1
      ],
      [
        "#CareerArc",
        1
      ],
      [
        "#photomariefrayssinet",
        1
      ],
      [
        "#Apartment",
        1
      ],
      [
        "#city",
        1
      ],
      [
        "#Rajasthan",
        1
      ],
      [
        "#morning",
        1
      ],
      [
        "#kitavegetariankakak",
        1
      ],
      [
        "#AZADI",
        1
      ],
      [
        "#bikaner",
        1
      ],
      [
        "#Jaipur",
        1
      ],
      [
        "#goodearth",
        1
      ],
      [
        "#JSA",
        1
      ],
      [
        "#Sector162",
        1
      ],
      [
        "#Jobs",
        1
      ],
      [
        "#Lonelyplanet",
        1
      ],
      [
        "#Land",
        1
      ],
      [
        "#Commercial",
        1
      ],
      [
        "#amberfort",
        1
      ],
      [
        "#Sector45",
        1
      ],
      [
        "#ResidentialPlot",
        1
      ],
      [
        "#Neu-Delhi",
        1
      ],
      [
        "#be",
        1
      ],
      [
        "#job?",
        1
      ],
      [
        "#incredibleindia",
        1
      ],
      [
        "#Indepedence#IndepedenceOfMind/Soul",
        1
      ],
      [
        "#adayintheworld",
        1
      ],
      [
        "#goodmorning",
        1
      ],
      [
        "#Respect",
        1
      ],
      [
        "#ForSale",
        1
      ],
      [
        "#Quran",
        1
      ],
      [
        "#Ghaziabad",
        1
      ],
      [
        "#dessert",
        1
      ],
      [
        "#2BHK",
        1
      ],
      [
        "#VaibhavKhand",
        1
      ],
      [
        "#snapdeal",
        1
      ],
      [
        "#orange\u2026",
        1
      ],
      [
        "#Air",
        1
      ],
      [
        "#Shop",
        1
      ],
      [
        "#quiz",
        1
      ],
      [
        "#FreeChargeInEverySale",
        1
      ],
      [
        "#4BHK",
        1
      ],
      [
        "#Job:",
        1
      ],
      [
        "#ForSale.",
        1
      ],
      [
        "#rajasthan",
        1
      ],
      [
        "#lunchtime",
        1
      ],
      [
        "#pink",
        1
      ],
      [
        "#Hiring",
        1
      ],
      [
        "#millionshadesofindia\u2026",
        1
      ]
    ],
    "links": [
      [
        "https://t.co/MQRmfcar0W",
        1
      ],
      [
        "https://t.co/W1UZZI8EXr",
        1
      ],
      [
        "https://t.co/HUzjAsfMo7",
        1
      ],
      [
        "https://t.co/aDjRdVFdj6",
        1
      ],
      [
        "https://t.co/JN4xZwbPj7",
        1
      ],
      [
        "https://t.co/sj4W08NwVU",
        1
      ],
      [
        "https://t.co/EDe2zMqdhV",
        1
      ],
      [
        "https://t.co/KhXQ93x63L",
        1
      ],
      [
        "https://t.co/IaqNfcOdAH",
        1
      ],
      [
        "https://t.co/rTXxP6x9NR",
        1
      ],
      [
        "https://t.co/ipKGMlkY0u",
        1
      ],
      [
        "https://t.co/6UeA17GkeY",
        1
      ],
      [
        "https://t.co/aRLHA9jHVt",
        1
      ],
      [
        "https://t.co/cSvbATv9d3",
        1
      ],
      [
        "https://t.co/2oIxzUlysa",
        1
      ],
      [
        "https://t.co/KaCUrhvdVp",
        1
      ],
      [
        "https://t.co/YRh80WARgo",
        1
      ],
      [
        "https://t.co/VjAwVi15N5",
        1
      ]
    ],
    "mentions": [
      [
        "@pankajTHEbhatia",
        3
      ],
      [
        "@Flw_ur_dreamz",
        2
      ],
      [
        "@hr_shapers",
        1
      ],
      [
        "@StarGoldIndia",
        1
      ],
      [
        "@1Patelzuber",
        1
      ],
      [
        "@cool_bulls",
        1
      ],
      [
        "@Shakir248",
        1
      ],
      [
        "@poonamraniratia",
        1
      ],
      [
        "@PanwarDimple",
        1
      ],
      [
        "@AanchalTripathi",
        1
      ],
      [
        "@IRCTC_Ltd",
        1
      ],
      [
        "@insanfamily1",
        1
      ],
      [
        "@RaviAwasthi",
        1
      ]
    ],
    "ner": {
      "GPE": {
        "Delhi": {
          "count": 1,
          "tweet_ids": [
            690084388172926976
          ]
        },
        "Good": {
          "count": 1,
          "tweet_ids": [
            690085581234249729
          ]
        },
        "INDIA": {
          "count": 1,
          "tweet_ids": [
            690088905757036544
          ]
        },
        "Janam": {
          "count": 1,
          "tweet_ids": [
            690085744262680576
          ]
        },
        "Modi": {
          "count": 1,
          "tweet_ids": [
            690088998023368706
          ]
        },
        "Oglio": {
          "count": 1,
          "tweet_ids": [
            690086707597832192
          ]
        },
        "Palace": {
          "count": 1,
          "tweet_ids": [
            690086219707977728
          ]
        },
        "Wo": {
          "count": 1,
          "tweet_ids": [
            690083923683049472
          ]
        }
      },
      "ORGANIZATION": {
        "Chocolate": {
          "count": 1,
          "tweet_ids": [
            690085201024847873
          ]
        },
        "Comfort My Soul": {
          "count": 1,
          "tweet_ids": [
            690086727701155841
          ]
        },
        "DDA Flats": {
          "count": 1,
          "tweet_ids": [
            690085517292146688
          ]
        },
        "DayOneIsAllowed2sayDifferently2Act": {
          "count": 1,
          "tweet_ids": [
            690087769331388416
          ]
        },
        "Employees": {
          "count": 1,
          "tweet_ids": [
            690086949059706881
          ]
        },
        "Fiserv": {
          "count": 1,
          "tweet_ids": [
            690083002785857536
          ]
        },
        "GPTW": {
          "count": 1,
          "tweet_ids": [
            690086949059706881
          ]
        },
        "Health": {
          "count": 1,
          "tweet_ids": [
            690086949059706881
          ]
        },
        "IndepedenceOnEarthWillBeThe": {
          "count": 1,
          "tweet_ids": [
            690087769331388416
          ]
        },
        "Parsvnath Majestic Arcade": {
          "count": 1,
          "tweet_ids": [
            690085272562831362
          ]
        },
        "SQFT Center Park": {
          "count": 1,
          "tweet_ids": [
            690087456339841024
          ]
        },
        "Shalimar Garden": {
          "count": 1,
          "tweet_ids": [
            690086707597832192
          ]
        },
        "WAHEGURU Give Me Hope": {
          "count": 1,
          "tweet_ids": [
            690086727701155841
          ]
        },
        "WOMEN": {
          "count": 1,
          "tweet_ids": [
            690082785944604672
          ]
        }
      },
      "PERSON": {
        "Ahhhh": {
          "count": 1,
          "tweet_ids": [
            690084388172926976
          ]
        },
        "Bikaner": {
          "count": 1,
          "tweet_ids": [
            690088905757036544
          ]
        },
        "Free": {
          "count": 1,
          "tweet_ids": [
            690086949059706881
          ]
        },
        "Network Engineer": {
          "count": 1,
          "tweet_ids": [
            690085601341784068
          ]
        },
        "None": {
          "count": 1,
          "tweet_ids": [
            690086727701155841
          ]
        },
        "Prateek": {
          "count": 1,
          "tweet_ids": [
            690087456339841024
          ]
        },
        "Prem": {
          "count": 1,
          "tweet_ids": [
            690088953186275329
          ]
        },
        "Ready": {
          "count": 1,
          "tweet_ids": [
            690087456339841024
          ]
        },
        "Shahpurjat": {
          "count": 1,
          "tweet_ids": [
            690084388172926976
          ]
        },
        "Technical": {
          "count": 1,
          "tweet_ids": [
            690083002785857536
          ]
        },
        "Truffle Cake": {
          "count": 1,
          "tweet_ids": [
            690085201024847873
          ]
        },
        "Twitter": {
          "count": 2,
          "tweet_ids": [
            690082451155189761,
            690082234393559041
          ]
        }
      }
    }
  }
}

scrappy's People

Contributors

aregee avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.