garetjax / irco Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 305 KB

International Research Collaboration graphs utility

License: MIT License

Ruby 1.39% Python 73.18% Shell 5.44% CoffeeScript 0.13% CSS 19.58% JavaScript 0.28%

irco's People

Stargazers

Watchers

Forkers

pombredanne

irco's Issues

Process records with invalid affiliation mappings

Some records in the WoS dumps contain ambiguous affiliation data. Currently these records are parsed incorrectly and, as a consequence, invalid data is added to the database.

Correct format

The correct format of the C1 field uses square brackets [] to define to which authors a given affiliation belongs. Example (with added newlines):

[El Miedany, Y.] North Kent Hosp, Dartford, England;
[El Gaafary, M.; El Yassaki, A.; Youssef, S.] Ain Shams Univ, Cairo, Egypt;
[Ahmed, I.] Cairo Univ, Cairo, Egypt;
[Hegazi, M. O.] Al Adan Hosp, Kuwait, Kuwait;
[Palmer, D.] North Middlesex Univ Hosp, London, England

When records are parsed using the correct format, authors parsed from the 'AF' record can be matched to the correct institution by using the information provided in the square brackets.

Wrong format

Some records in the dumped file, have a different structure for the C1 field. The square brackets are missing. Example (with added newlines):

Adan Hosp, Minist Hlth, Dept Med, Ahmadi City, Kuwait;
Adan Hosp, Minist Hlth, Intens Care Unit, Ahmadi City, Kuwait;
Adan Hosp, Minist Hlth, Dept Radiol, Ahmadi City, Kuwait;
Adan Hosp, Minist Hlth, Dept Med, Ahmadi City, Kuwait;
Adan Hosp, Minist Hlth, ICU, Ahmadi City, Kuwait

Records using this format don't contain enough information to allow a precise matching of authors affiliation. The example presented above was taken from a publication with the following content for the AF field:

Bitar, ZI;
Ashebu, SD;
Ahmed, S

As this example shows, positional matching of affiliations to authors is not possible as there are records with more authors than institutions, or more institutions than authors.

Given the issue presented above, I think that the cleanest solution would be to completely ignore these records, even if that means ignoring 25%* of the records.

Actually it is less than that. 25% are the numbers of records with no Kuwaiti affiliation, including the ones that WoS erroneously includes because an affiliation contains Kuwait in its name.

adding first author affiliation field

This issue builds on the request in issue 14.

In many cases, the papers have missing reprint author information. In those cases, it would be useful if we can use first author affiliation instead. Can we add another field in the publications table that stores first author affiliation? This would then allow us to run queries in which we specify the corresponding author country and if that is missing, to then use first author affiliation country to search for publications.

Add headers to the CSV raw input

Automate data download from WoS

checking author affiliation and country name capitalization

in some cases the raw data has country name in capital letters, could this be interfering in associating author affiliations to country names? I am sending a detailed file and some raw data for you to please check.

Provide a generic data processing framework

Processors can then be registered and applied to the database using a dedicated command or through the web interface.

documentation request for irco-graph

It would be very helpful if you can document how each of the country, institution and author graphs are generated. What is the node, edge, and node weight or edge weight definition.

Save data modification history into the database

In order to be able to rebuild a consistent data set and to trace the actions that brought to a certain condition.

Ideally, each entry would carry the IRCO version, timestamp, exact command,...

Extract country information from the institution name

Add irco version to the source metadata

improving author affiliation data

I found the following issue when analyzing detailed data for Qatar. For publication year 1985 there are a total of 15 records in the raw data file (in savedrecs-8.txt). See rows 486 to 500 when you open this .txt file in excel. Of these 15 records, I get 11 to be imported in the database when I run the irco-import command. I checked the author affiliation information in the RP field, and I find that of the four records that are not included (in rows 489, 491, 497 and 498), two of them have clear reprint author information. Row 491 and Row 497 show reprint author information clearly, however the affilation of other authors in the papers is unclear (as seen in field C1). In these cases (where reprint author info is available) but other authors are unclear, let's include these records also. And in such cases, let's put in a 'NA' or some other data for the missing affiliation of the other authors. This would then allow us to do a better counting of papers by corresponding author countries which is an important part of the analysis.

Split author affiliations into multiple affiliations when more than one are present.

An author can have multiple affiliations for a single publication as well. For example scopus/2-s2.0-84858999554 in testdata/scopus.csv

Import files with extensions other than txt when recursing into folders

feature request for directed graphs

It would be very useful if we can generate directed graphs of country, institution, and authors. The corresponding author (or reprint author or first author) would be the source, and the target nodes will be its co-authors. In a country graph, the source would country of corresponding author, and target node will be country of co-author. The weight of a directed edge in a country graph would be number of papers between the countries. Say for example, if there are 50 papers in which the corresponding author is from country A and has co-author in country B, and 20 papers are from a corresponding author in country B with a co-author in country A, the directed edge from A to B will have weight 50 and directed edge from B to A will have weight 20. This type of graph will allow us to compute some reciprocity metrics (at country level and author level).

feature request for data import report

The new irco-import function is working fine now. However, can a feature be added so that it generates a csv file in which it places all records that did not get included in the database due to ambiguous affiliation information? And can it also add in that csv file, a report of how many records in each publication year were ignored due to ambiguous information? This would allow for more precise information about the data that was omitted from analysis due to insufficient affiliation information.

country affiliation for UAE data

I have noted that in my records the country affiliation for publications from UAE is not being recorded. I get messages in the terminal window when running the irco-graph command and irco-import command. In many cases the name is given in the record as U Arab Emirates. I am attaching a screen shot here:

Provide functionality to merge authors or institutions

irco import not working with new release irco 0.9

I upgraded my irco version to 0.9 and I get an error when I run the irco-import command. I am sending you the error log. I was importing the same datafiles as I have used before with irco, so I don't think it is an issue with the data files.