Giter Site home page Giter Site logo

team-iteam-a's Introduction

Overview of the solution by i-Team A: Disinformation Analyzer

Problem

Josep Borrell, High Representative/Vice-President, adequately described the problem as follows: “We have to focus on foreign actors who intentionally, in a coordinated manner, try to manipulate our information environment. We need to work with democratic partners around the world to fight information manipulation by authoritarian regimes more actively. It is time to roll up our sleeves and defend democracy, both at home and around the world.” source: FIMI. So, for this challenge, the focus is on calculating the credibility that a news article or social media message represents true information, and not disinformation.

Methodology

The methodology has been created by understanding the anatomy of disinformation and misinformation.

Types of Misinformation and Disinformation

  • Fabricated Content: false content;
  • Manipulated Content: Genuine information or imagery that has been distorted, e.g. a sensational headline or populist ‘click bait’;
  • Imposter Content: Impersonation of genuine sources, e.g. using the branding of an established agency;
  • Misleading Content: Misleading information, e.g. comment presented as fact;
  • False Context: Factually accurate content combined with false contextual information, e.g. when the headline of an article does not reflect the content;
  • Satire and Parody: Humorous but false stores passed off as true. There is no intention to harm but readers may be fooled;
  • False Connections: When headlines, visuals or captions do not support the content;
  • Sponsored Content: Advertising or PR disguised as editorial content;
  • Propaganda: Content used to manage attitudes, values and knowledge;
  • Error: A mistake made by established new agencies in their reporting.

In addition to new and more sophisticated ways of manipulating content, there are also a growing number of ways in which social media can be used to manipulate conversations:

  • A Sockpuppet is an online identity used to deceive. The term now extends to misleading uses of online identities to praise, defend, or support a person or organization; to manipulate public opinion; or to circumvent restrictions, suspension or an outright ban from a website. The difference between a pseudonym and a sockpuppet is that the sockpuppet poses as an independent third party, unaffiliated with the main account holder. Sockpuppets are unwelcome in many online communities and forums;
  • Sealioning is a type of trolling or harassment where people are pursued with persistent requests for evidence or repeated questions. A pretence of civility and sincerity is maintained with these incessant, bad-faith invitations to debate;
  • Astroturfing masks the sponsors of a message (e.g. political, religious, advertising or PR organizations) to make it appear as though it comes from grassroots participants. The practice aims to give organizations credibility by withholding information about their motives or connections;
  • Catfishing is a form of fraud where a person creates a sockpuppet or fake identity to target a particular victim on social media. It is common for romance scams on dating websites. It may be done for financial gain, to compromise a victim or as a form of trolling or wish fulfilment. Understanding the anatomy of fake news, people can detect if a news is true or false. To detect it people are advised to ask 10 questions:

10 questions to ask

Therefore, the approach that is followed can be split into 4 stages, and each stage is discussed in more detail below:

  1. Collect relevant content.
  2. Enrich that content using ML.
  3. Add vector embeddings to the content for semantic analysis and store the content.
  4. Analyse the content, automatically and/or manually.

Next, the implementation is discussed: it is based on the open source event streaming platform, Apache Kafka, which is used to connect many micro-services that enrich the content. In the context of this hackathon, a part of this solution is shared - the full solution is, in principle, available to NATO-affiliated government organisations, so please reach out to us to discuss it.

Subsequently, the results are explained. Based on the data in this repository, a user can play with the provided content in the database, either by querying it using GraphQL, or by running a Jupyter notebook. In the full solution, a GUI can be used that is more powerful, but not part of the released software (see the comment above).

Collecting content

Simplified pipeline

Analysts are in the best position to determine what information sources contain the most relevant content: RSS feeds, websites, telegram channels, twitter hashtags, etc. They can specify them in the GUI (not included), including their refresh rate. Alternatively, they can upload their own URLs manually or via a script.

The provided dataset contains, for example, data from the provided CSVs, but also from TASS, EMM (Europe Media Monitor), Google News, New York Times, and several others.

When the relevant channels are specified, the configuration is published to Kafka, and the crawlers and scrapers start to collect content. In case of RSS feeds, the RSS crawlers first analyse the RSS feed for new content, and subsequently publish the new article links to Kafka. In the complete framework, there are many scrapers, e.g. for generic websites, dedicated websites, telegram, and twitter. The twitter service, that was developed during the hackathon, is available in the twitter-service, and it should provide an example of how easy it is to add a new service. Discovered content, be it text or images, are published by the scrapers to Kafka as well, so the content can be processed in the next stage.

Processing content

Simplified sequence diagram of the pipeline

When the article content is available, many NLP microservices start to work in parallel to enrich the retrieved articles. To name a few:

  • Language detection & translation using Franc and LibreTranslate
  • Summarizing
  • Named Entity Recognition (+keywords)
  • Geo-tagging
  • Face detection
  • Sentiment & Emotion score
  • Readability score
  • Sarcasm/joke score
  • Topic detection: Louvain algorithm
  • Channel affiliation & credibility
  • Semantic word embeddings in Weaviate: semitechnologies/transformers-inference:sentence-transformers-paraphrase-multilingual-mpnet-base-v2
  • Semantic image embeddings in Weaviate: semitechnologies/img2vec-pytorch:resnet50

During this hackathon, we developed the emotion and readability score microservice, which are available in this repository as well.

The outcome of each microservice is, again, published to Kafka, and aggregated in the next stage.

Storing and semantically embedding content

When the hard work is done, and the articles and tweets are analysed in detail, the results are uploaded to the database. The database that is selected is Weaviate, which is a so-called vector database. Basically, it not only stores your data, as so many other databases do too, but it first computes a word embedding using a multi-lingual BERT-based NLP transformer-service. This is done for the article as a whole, but also for each paragraph, which enables semantic search and Question & Answer.

In the infrastructure folder, you see a complete setup for you to test: it contains the weaviate database service, the transformers for text and images, and a Jupyter notebook service, so you can play with it yourself. Alternatively, you can query the database directly using GraphQL.

Analysing content

In the final stage, the enriched content is presented to the analyst. A disinformation score is computed based on the computed NLP attributes and the relevancy of an article's content with respect to the current narrative. The analyst can query it using GraphQL or Jupyter Notebooks. See the examples in the folders infrastructure, to get everything running locally, and Jupyter-Weaviate-interface to build the Jupyter notebook.

Example of the Weaviate GraphQL interface Example of a Jupyter Notebook

Implementation

Our implementation is a microservices-based architecture, where each microservice is connected to Apache Kafka. Kafka acts as the middleware glue, connecting all microservices, so they can easily exchange information between each other.

To facilitate deployment, all services are running in Docker: for testing purposes, we run on one or two older Dell desktops with 32Gb but without GPU. All dockerized microservices and other services are connected through Apache Kafka using a single broker.

Only communication to the Weaviate - vector search engine DB is through REST.

Weaviate is configured to vectorize text (i.e. create semantic word embeddings of the whole article and each paragraph), images (so we can recognize similar images), and it includes a Question & Answering service. The latter, for example, can be used to ask a question such as “Who is the current president of the US?”. However, more interestingly, it could also be used to verify the credibility of a news channel: define control questions that you know the answer of, and ask the channel to provide the answers. If there are many wrong answers, you can mark the news channel as untrustworthy. And, of course, this could be automated too.

In the GUI, all saved articles are stored inside a knowledge graph, so we can do graph queries too.

Federated analysis & learning

Although our research environment is running standalone, it could also run in a federated context, connecting different organisations. The scraped and enriched articles by organisation A can easily be shared with another Kafka cluster run by organisation B, so it doesn't need to scrape the same websites. In addition, analysist feedback, e.g. the credibility of a news channel or article, can also be shared through Kafka, improving the analysis capability of organisations.

image

Besides federated analysis, the fact that analysist's feedback is stored back in the database, supports learning from examples. AI models can be trained to suggest other disinformation messages, similarly to ASreview that helps researchers quickly discover relevant articles from a list of articles.

Open source services and NLP models that are used

Screenshots from the Analyst Dashboard

The analyst can search through the enriched articles, either via GraphQL or a Jupyter Notebook, but also through our own GUI.

Search through relevant articles

Examine a disinformation narrative

Explore the results in a cluster diagram

Or explore the results on the map

team-iteam-a's People

Contributors

driver-deploy-2 avatar erikvullings avatar tidehackathon avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.