garetjax / irco Goto Github PK
View Code? Open in Web Editor NEWInternational Research Collaboration graphs utility
License: MIT License
International Research Collaboration graphs utility
License: MIT License
Some records in the WoS dumps contain ambiguous affiliation data. Currently these records are parsed incorrectly and, as a consequence, invalid data is added to the database.
The correct format of the C1
field uses square brackets []
to define to which authors a given affiliation belongs. Example (with added newlines):
[El Miedany, Y.] North Kent Hosp, Dartford, England;
[El Gaafary, M.; El Yassaki, A.; Youssef, S.] Ain Shams Univ, Cairo, Egypt;
[Ahmed, I.] Cairo Univ, Cairo, Egypt;
[Hegazi, M. O.] Al Adan Hosp, Kuwait, Kuwait;
[Palmer, D.] North Middlesex Univ Hosp, London, England
When records are parsed using the correct format, authors parsed from the 'AF' record can be matched to the correct institution by using the information provided in the square brackets.
Some records in the dumped file, have a different structure for the C1
field. The square brackets are missing. Example (with added newlines):
Adan Hosp, Minist Hlth, Dept Med, Ahmadi City, Kuwait;
Adan Hosp, Minist Hlth, Intens Care Unit, Ahmadi City, Kuwait;
Adan Hosp, Minist Hlth, Dept Radiol, Ahmadi City, Kuwait;
Adan Hosp, Minist Hlth, Dept Med, Ahmadi City, Kuwait;
Adan Hosp, Minist Hlth, ICU, Ahmadi City, Kuwait
Records using this format don't contain enough information to allow a precise matching of authors affiliation. The example presented above was taken from a publication with the following content for the AF
field:
Bitar, ZI;
Ashebu, SD;
Ahmed, S
As this example shows, positional matching of affiliations to authors is not possible as there are records with more authors than institutions, or more institutions than authors.
Given the issue presented above, I think that the cleanest solution would be to completely ignore these records, even if that means ignoring 25%* of the records.
This issue builds on the request in issue 14.
In many cases, the papers have missing reprint author information. In those cases, it would be useful if we can use first author affiliation instead. Can we add another field in the publications table that stores first author affiliation? This would then allow us to run queries in which we specify the corresponding author country and if that is missing, to then use first author affiliation country to search for publications.
in some cases the raw data has country name in capital letters, could this be interfering in associating author affiliations to country names? I am sending a detailed file and some raw data for you to please check.
Processors can then be registered and applied to the database using a dedicated command or through the web interface.
It would be very helpful if you can document how each of the country, institution and author graphs are generated. What is the node, edge, and node weight or edge weight definition.
In order to be able to rebuild a consistent data set and to trace the actions that brought to a certain condition.
Ideally, each entry would carry the IRCO version, timestamp, exact command,...
I found the following issue when analyzing detailed data for Qatar. For publication year 1985 there are a total of 15 records in the raw data file (in savedrecs-8.txt). See rows 486 to 500 when you open this .txt file in excel. Of these 15 records, I get 11 to be imported in the database when I run the irco-import command. I checked the author affiliation information in the RP field, and I find that of the four records that are not included (in rows 489, 491, 497 and 498), two of them have clear reprint author information. Row 491 and Row 497 show reprint author information clearly, however the affilation of other authors in the papers is unclear (as seen in field C1). In these cases (where reprint author info is available) but other authors are unclear, let's include these records also. And in such cases, let's put in a 'NA' or some other data for the missing affiliation of the other authors. This would then allow us to do a better counting of papers by corresponding author countries which is an important part of the analysis.
An author can have multiple affiliations for a single publication as well. For example scopus/2-s2.0-84858999554 in testdata/scopus.csv
It would be very useful if we can generate directed graphs of country, institution, and authors. The corresponding author (or reprint author or first author) would be the source, and the target nodes will be its co-authors. In a country graph, the source would country of corresponding author, and target node will be country of co-author. The weight of a directed edge in a country graph would be number of papers between the countries. Say for example, if there are 50 papers in which the corresponding author is from country A and has co-author in country B, and 20 papers are from a corresponding author in country B with a co-author in country A, the directed edge from A to B will have weight 50 and directed edge from B to A will have weight 20. This type of graph will allow us to compute some reciprocity metrics (at country level and author level).
The new irco-import function is working fine now. However, can a feature be added so that it generates a csv file in which it places all records that did not get included in the database due to ambiguous affiliation information? And can it also add in that csv file, a report of how many records in each publication year were ignored due to ambiguous information? This would allow for more precise information about the data that was omitted from analysis due to insufficient affiliation information.
I have noted that in my records the country affiliation for publications from UAE is not being recorded. I get messages in the terminal window when running the irco-graph command and irco-import command. In many cases the name is given in the record as U Arab Emirates. I am attaching a screen shot here:
I upgraded my irco version to 0.9 and I get an error when I run the irco-import command. I am sending you the error log. I was importing the same datafiles as I have used before with irco, so I don't think it is an issue with the data files.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.