Giter Site home page Giter Site logo

fagan2888 / extracted Goto Github PK

View Code? Open in Web Editor NEW

This project forked from onsbigdata/extracted

0.0 0.0 0.0 451 KB

Scripts to extract and parse TED (Tenders Electronic Daily: http://ted.europa.eu/TED/main/HomePage.do) documents.

License: MIT License

Python 100.00%

extracted's Introduction

ExtracTED

Overview

TED (Tenders Electronic Daily) is the online version of the 'Supplement to the Official Journal' of the EU, dedicated to European public procurement.

TED provides free access to public procurement notices from contracting authorities based in the European Union and in the European Economic Area (also if they provide services in any other country).

TED website currently offers bulk downloads of XML packages dating back to 1993, which can be found on the FTP ftp://ted.europa.eu/ accessible with generic credentials (guest/guest).

This repo contains scrips to extract and parse TED Contract award notices, i.e. the results of the procurement procedure. The scripts have been fully tested to extract all Contract award notices for 2014, 2015 and 2016. Documents are extracted as Python dictionaries which can be saved in a MongoDB database for easy retrieval, or further normalised to be converted in CSVs.

Example of TED Contract award notice

The image below show an example of document as shown on the TED website.

Image

Example of extracted dictionary

An example of extracted data (from same document above):

{
  'DOC_ID': '466898-2016',
  'CODED_DATA': {
    'NOTICE_DATA': {
      'IA_URL_GENERAL': 'www.ekz.ch',
      'ISO_COUNTRY': 'CH',
      'NO_DOC_OJS': '2016/S 253-466898',
      'ORIGINAL_CPV': ['79970000'],
      'REF_NOTICE': ['2016/S 172-310484']
      },
    'CODIF_DATA': {
      'AA_AUTHORITY_TYPE': '4',
      'AC_AWARD_CRIT': 'Z',
      'DS_DATE_DISPATCH': '20161230',
      'MA_MAIN_ACTIVITIES': ['Z'],
      'NC_CONTRACT_NATURE': '4',
      'PR_PROC': '2',
      'RP_REGULATION': '7',
      'TD_DOCUMENT_TYPE': '7',
      'TY_TYPE_BID': '9'
      }
    },
 'CONTRACT': {
   'OTH_NOT': 'NO',
   'CONTRACTING_AUTHORITY': 'Elektrizitätswerke des Kantons Zürich',
   'CONTRACT_OBJECT': {
     'CONCLUSION_FRAMEWORK_AGREEMENT': 'NO',
     'CONTRACTS_DPS': 'NO',
     'CONTRACT_COVERED_GPA': 'YES',
     'CPV_MAIN': '79970000'
     }
   },
   'AWARDS_OF_CONTRACT': [
    {
      'CONTRACTOR': {
         'ADDRESS': 'Mühlebachstraße 52',
         'COUNTRY': 'CH',
         'OFFICIALNAME': 'Linkgroup AG',
         'POSTAL_CODE': '8008',
         'TOWN': 'Zürich'
         },
      'CONTRACT_VALUE': {
        'COST': {
          'CURRENCY': 'CHF',
          'VALUE': 700000.0
          }
        }
    }
   ]
}

Fields

Fields and description of the extracted data

1. DOC_ID section

Field Data type Description
DOC_ID String Unique document number in TED.

2. CODED_DATA section

The CODED_DATA section is divided in 2 groups of data.

  • NOTICE_DATA contains data related to the notice
  • CODIF_DATA contains the META data (codification) related to the notice.

2.1 NOTICE_DATA section

Field Data type Description
NO_DOC_OJS String Notice number in TED
ORIGINAL_NUTS List(String) Region code(s) of the place of performance or delivery. A 2-5 digits code of the Nomenclature of Territorial Units for Statistics. Lookup values for first two digits: ISO_COUNTRY.csv
ORIGINAL_CPV List(String) Product or service 8 digits code(s) of the Common Procurement Vocabulary. Lookup values for first two digits: CPV.csv
ISO_COUNTRY String 2-characters ISO code of the country where the contracting body is located. Lookup values: ISO_COUNTRY.csv
IA_URL_GENERAL String Main internet address (URL) of the contracting body
REF_NOTICE List(String) Reference notice number in TED. Referencing a previous publication (prior information, corrigendum, ...)
VALUES_LIST Total Value Estimated total value(s) or total final value of the procurement
2.1.1 VALUES_LIST sub-section

VALUES_LIST can be composed of the following fields:

Field Data type Description
GLOBAL_VALUE Value The total value of the tender. Should equal the sum of in CONTRACTS_VALUE List
CONTRACTS_VALUE List(Value) Individual awards value. This happens when the contract is composed of multiple parts or lots.
Value Data Type

Each value can be single or range value and can have VAT percentage

  • When is single:
Field Data type Description
CURRENCY String Currency of the value
VALUE Float Value
VAT_PRCT Int Vat percentage
  • When is range:
Field Data type Description
CURRENCY String Currency of the value
LOW_VALUE Float Lower value of the range
HIGH_VALUE Float Higher value of the range
VAT_PRCT Int Vat percentage

2.2 CODIF_DATA section

Field Data type Description
DS_DATE_DISPATCH String Date of dispatch of the notice. Format: yyyymmdd
TD_DOCUMENT_TYPE String Type of document. Lookup values: TD_DOCUMENT_TYPE.csv
AA_AUTHORITY_TYPE String Type of awarding authority. Lookup values: AA_AUTHORITY_TYPE.csv
NC_CONTRACT_NATURE String Nature of contract. Lookup values: NC_CONTRACT_NATURE.csv
PR_PROC String Type of procedure. Lookup values: PR_PROC.csv
RP_REGULATION String The regulation that applies to the procedure. Lookup values: RP_REGULATION.csv
TY_TYPE_BID String Type of bid. Lookup values: TY_TYPE_BID.csv
AC_AWARD_CRIT String Type of awarding criteria. Lookup values: AC_AWARD_CRIT.csv
MA_MAIN_ACTIVITIES List(String) Main activity of the contracting body. Lookup values: MA_MAIN_ACTIVITIES.csv

3. CONTRACT section

The CONTRACT section contains the notice itself, in the XML format. In the original XML file this section can be available in different translations, up to 24 when is fully translated. If available, the English translation of the contract is preferred. Otherwise, French and German becomes second and third choice respectively. When any of this translation is not present, the first available is picked.

3.1 OTH_NOT section

Field Data type Description
OTH_NOT String Indicate whether the notice follow a standard structure (then OTH_NOT = NO) or structure is open to allow the publication of any other notice which is not following a standard form, aka non-structured notice (then OTH_NOT = YES). Only when OTH_NOT = NO, the fields CONTRACTING_AUTHORITY, CONTRACT_OBJECT and AWARDS_OF_CONTRACT are extracted

3.2 CONTRACTING_AUTHORITY section

Field Data type Description
CONTRACTING_AUTHORITY String Name of Contracting Authority

3.3 CONTRACT_OBJECT section

Object of the contract.

Field Data type Description
NUTS List(String) Region code(s) of the place of performance or delivery. A 2-5 digits code of the Nomenclature of Territorial Units for Statistics. Lookup values for first two digits: ISO_COUNTRY.csv
NUTS_EXTRA String Additional extracted comments on the place of performance or delivery
CPV_MAIN String Main Product or service 8 digits code of the Common Procurement Vocabulary. Lookup values for first two digits: CPV.csv
CONTRACT_COVERED_GPA String YES or NO if contract is covered by GPA (Government Procurement Agreement)
CONCLUSION_FRAMEWORK_AGREEMENT String YES or NO if contract is part of a Framework Agreement
CONTRACTS_DPS String YES or NO if contract is subject to a Dynamic Purchasing System
CONTRACT_VALUE Contract Value Total value. Should correspond to the sum of individual contractor's contract values. See Section AWARDS_OF_CONTRACT
3.3.1 CONTRACT_VALUE sub-section

Each contract value can be composed of the following fields:

Field Data type Description
COST Value The actual cost
ESTIMATE Value An initial estimate of the cost
NUMBER_OF_YEARS Int Number of year the contract last. Cost/Estimate might be given per year
NUMBER_OF_MONTHS Int Number of month the contract last. Cost/Estimate might be given per month
Value Data Type

Fields are same as mentioned above. See Section Value Data Type

3.4 AWARDS_OF_CONTRACT section

List of awards.

Each award can contain the following sub-sections:

  • CONTRACTOR: General Information about successful bidder
  • CONTRACT_VALUE: Contract value associated with contractor.
3.4.1 CONTRACTOR sub-section
Field Data type Description
OFFICIALNAME String Name of Contractor
COUNTRY String Country of Contractor
ADDRESS String Address (Street Name) of Contractor
TOWN String Town of Contractor
POSTAL_CODE String Postal Code of Contractor
3.4.2 CONTRACT_VALUE sub-section

Fields are same as mentioned above. See Section 3.3.1 CONTRACT_VALUE

How do I use it?

Requirements/Pre-requisites

Python modules required:

  • lxml
  • voluptuous
  • pandas

Optional:

  • pymongo

Python version: 3.6 Database: MongoDB, local mode

How to run the project locally

Example of how to convert a document

file_path = '454322_2015.xml'  # Contract Award notice downloaded from tED website

# Extract the raw data
from extractor import extract
raw = extract(file_path)

# Validate the raw data
from validator import validate
data = validate(raw)

# Prune the data: remove empty fields
from validator import prune
prune(data)

Project Structure

Repository structure:

.
├── Lookups
│   └── AA_AUTHORITY_TYPE.csv
│   └── AC_AWARD_CRIT.csv
│   └── CPV.csv
│   └── ISO_COUNTRY.csv
│   └── MA_MAIN_ACTIVITY.csv
│   └── NC_CONTRACT_NATURE.csv
│   └── PR_PROC.csv
│   └── RP_REGULATION.csv
│   └── TD_DOCUMENT_TYPE.csv
│   └── TY_TYPE_BID.csv
├── extractor.py
└── validator.py
└── mongo_import.py
  • Lookups: folder containing various lookup files
  • extractor.py: script to extract raw data from the contract award notices
  • validator.py: scripts to validate the raw data and prune the dictionary (i.e. remove the empty fileds)
  • mongo_import.py: script to upload the data in a MongoDB database

Contributors

Alessandra Sozzi, working for the Office for National Statistics Big Data project

LICENSE

Released under the MIT License.

extracted's People

Contributors

alessandrasozzi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.