Giter Site home page Giter Site logo

IP Clearance about nuttx HOT 68 CLOSED

justinmclean avatar justinmclean commented on August 19, 2024
IP Clearance

from nuttx.

Comments (68)

protobits avatar protobits commented on August 19, 2024 1

Could we start with the easy cases? I feel that reducing the size of the problem also makes it less intimidating to approach.
We are already manually changing headers from BSD to apache for files whose authors are commiters with ICLAs so I think making an automated pass for this case should not be that hard: parse header for authors, see if all are commiters, replace with apache header. If that sounds right I can script that and give it a try.

What confuses me though is that we're worrying about git authors whereas I believe that if someone contributes a file without listing themselves as the authors in the header (for the BSD case), didn't the author concede rights over the code by doing so? At least that was my understanding at the time when I submitted patches to existing files and I did not include an extra line to add me as author to every affected file. In case this is not the correct assumption, I agree that a "best effort" approach (by comparing git author to authors on header) is the only remaining possibility.

from nuttx.

btashton avatar btashton commented on August 19, 2024

@justinmclean Can you help me understand our requirements here a little bit more with a couple examples:
1.

https://github.com/apache/incubator-nuttx/blob/master/arch/risc-v/include/arch.h
It would seem that this needs to keep the BSD header until Ken re-licenses it under Apache, and we need to call this file out in the LICENSE file as BSD-3, it would not need to be called out in the NOTICE file.

https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/arm/arm.h
This one we can put the Apache header on, but do not need to make and additions to the NOTICE or LICENSE files beyond the boilerplate Apache. This is because Greg has agreed to re-licence this code.

https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/imxrt/hardware/rt102x/imxrt102x_ccm.h
This one we can put the Apache header on, but do not need to make and additions to the NOTICE or LICENSE files beyond the boilerplate Apache. This is because Greg has agreed to re-licence this code, and while there are other Authors listed he is the sole copyright holder listed.

https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/imxrt/imxrt_lcd.c
It would seem that this needs to keep the BSD header unless NXP is willing to relicense it under Apache even though portions are copyrighted by Greg, and we need to call this file out in the LICENSE file as BSD-3, it would not need to be called out in the NOTICE file.

General Questions:
When do we need to be adding the "Based on source code originally developed by" to the NOTICE file. In a couple of the files coming from FreeBSD I see entries like

  • Portions of this software were developed by David Chisnall
  • under sponsorship from the FreeBSD Foundation.

I know we have other files with other license or cases to go through, but this should cover the vast majority and can get us moving in the right direction.

from nuttx.

btashton avatar btashton commented on August 19, 2024

@justinmclean any thoughts on these examples. I'm trying to be 100% sure I understand what we need to do here to move this forward in a meaningful way.

from nuttx.

justinmclean avatar justinmclean commented on August 19, 2024
  1. Correct
  2. Correct
  3. It would depend on the history of the file and changes made. In general unless teh changes are significant the original license and header should be kept.
  4. Would need to be discussed, in general 3rd party headers should not be changed without permission. Looks like they have here and it would be best to revert to the original header.

from nuttx.

justinmclean avatar justinmclean commented on August 19, 2024

Note that with a WIP disclaimer none of this actually blocks a release.

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

License clearing wiki page (with draft process and tools): https://cwiki.apache.org/confluence/display/NUTTX/License+Clearing

This was used in release 9.0.0 and 9.1.0.

from nuttx.

xiaoxiang781216 avatar xiaoxiang781216 commented on August 19, 2024

Add the related email thread:
https://lists.apache.org/thread.html/r0d30d8c95e861826a3027499fc43bc3851e19f89fdaf8606eada1818%40%3Cdev.nuttx.apache.org%3E
https://lists.apache.org/thread.html/r3149c844791bd0164a3016cbebc690edd9277905678cfb33526937cb%40%3Cdev.nuttx.apache.org%3E
https://lists.apache.org/thread.html/r897f825f1bfcd3501c132438acc9403a70d415652119d1e528f7349f%40%3Cdev.nuttx.apache.org%3E

from nuttx.

xiaoxiang781216 avatar xiaoxiang781216 commented on August 19, 2024

@adamfeuer do you have enough free time to collect the statistics inforamtion? My team leader reserve a dedicated resource help you to improve the tools and generate the report. @PeterBee97.

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

Thanks @xiaoxiang781216 – I should have enough time to do a high-level analysis this week or next, and I could definitely use the help!

@PeterBee97 are you able to help me do this? If so, reply here or send me an email (it's on my profile), and we'll work out what to do. 🙂

from nuttx.

PeterBee97 avatar PeterBee97 commented on August 19, 2024

@adamfeuer Hi Adam, sure I'm here to help. BTW I spent some time yesterday on a script that doesn't modify anything yet but only tries to extract information. Hope this helps :)

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@PeterBee97 Great work with the script and database! I'll update my tools branch and post it here– would you be willing to do a PR to that, so we can have a single branch that we're working on? I'm hoping we can merge these tools to master so that others can help us or continue our work.

Here's a few questions:

  1. Are you subscribed to the [email protected] email list? If not, would you be willing to subscribe?
  2. What's your email address? Will you either post it here, send me an email at [email protected]? So we can correspond with the NuttX email list if necessary.
  3. What time zone are you in? I am in Seattle WA USA, Pacific Time Zone, UTC-7.
  4. Have you seen the NuttX license clearing wiki page? The process we need to follow and improve is there, as well as a few tools.
  5. The authors in the file are good to have, but not enough to clear the licenses– we need to look at the git log and get authors from that. There's a script on the wiki page above that can do that.
  6. Would you be willing to make the script you wrote also emit a plain text file, ideally tab delimited CSV?

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@PeterBee97 I updated my license-clearing tools branch to upstream/master, here's where I've put my tools: https://github.com/starcat-io/incubator-nuttx/tree/feature/license-clearing-tools/tools/license-clearing

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@PeterBee97 Let's try running the process that we did on the sched/ module on either fs/ or mm/– only the estimation part, not the whole clearing process. They have 100-250 files each, so it's a smaller chunk. We need git authors as well a what is in the file headers. Once we have a way to get stats for that module and all files, then we can try to do it for the whole project.

You can see what we did on sched/ at this wiki subpage: https://cwiki.apache.org/confluence/display/NUTTX/Analysis+March+2020

from nuttx.

PeterBee97 avatar PeterBee97 commented on August 19, 2024

@PeterBee97 Great work with the script and database! I'll update my tools branch and post it here– would you be willing to do a PR to that, so we can have a single branch that we're working on? I'm hoping we can merge these tools to master so that others can help us or continue our work.

Here's a few questions:

  1. Are you subscribed to the [email protected] email list? If not, would you be willing to subscribe?
  2. What's your email address? Will you either post it here, send me an email at [email protected]? So we can correspond with the NuttX email list if necessary.
  3. What time zone are you in? I am in Seattle WA USA, Pacific Time Zone, UTC-7.
  4. Have you seen the NuttX license clearing wiki page? The process we need to follow and improve is there, as well as a few tools.
  5. The authors in the file are good to have, but not enough to clear the licenses– we need to look at the git log and get authors from that. There's a script on the wiki page above that can do that.
  6. Would you be willing to make the script you wrote also emit a plain text file, ideally tab delimited CSV?
  1. Not yet, sure I'm willing to subscribe
  2. [email protected]
  3. I'm in Beijing, UTC+8 so my work time will be about 7 pm to 7 am in your timezone :(
  4. Yes, I browsed through the docs and mailing lists before making that tool
  5. Yeah, actually my tool is based on your script. The author0~author2 are from git log
  6. Sure, exporting to csv file is just one command in sqlite

@PeterBee97 Let's try running the process that we did on the sched/ module on either fs/ or mm/– only the estimation part, not the whole clearing process. They have 100-250 files each, so it's a smaller chunk. We need git authors as well a what is in the file headers. Once we have a way to get stats for that module and all files, then we can try to do it for the whole project.

You can see what we did on sched/ at this wiki subpage: https://cwiki.apache.org/confluence/display/NUTTX/Analysis+March+2020

By typing sched/ in the DB Browser filter I can see that these files either have apache license already or only owe copyrights to Greg or Xiaomi & Pinecone, which should have already approved the license change.

The csv files are uploaded && PR created. https://github.com/PeterBee97/incubator-nuttx/tree/feature/license-clearing-tools/tools/license-clearing

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@PeterBee97 Cool, thanks– I didn't realize the script already used git to find the authors, sorry for missing that. We will need all the authors, not just the top 3. I'll take a closer look tomorrow.

Re: Xiaomi and Pinecone already approving the license change, do you know if they have filed an Apache Software Grant Agreement (SGA)?

Would you be willing to run your tool on fs and mm directories, and see if you can extract a report of the authors for each section and file? That way we can see if we're dealing with 10 authors, 100 authors, etc.

I think another next step is to get you an account on the NuttX Fossology instance. At some point we'll need to get the data into there. I'll email Brennan and you on the list.

Thanks again for being willing to help with this!

from nuttx.

PeterBee97 avatar PeterBee97 commented on August 19, 2024

@PeterBee97 Cool, thanks– I didn't realize the script already used git to find the authors, sorry for missing that. We will need all the authors, not just the top 3. I'll take a closer look tomorrow.

Re: Xiaomi and Pinecone already approving the license change, do you know if they have filed an Apache Software Grant Agreement (SGA)?

Would you be willing to run your tool on fs and mm directories, and see if you can extract a report of the authors for each section and file? That way we can see if we're dealing with 10 authors, 100 authors, etc.

I think another next step is to get you an account on the NuttX Fossology instance. At some point we'll need to get the data into there. I'll email Brennan and you on the list.

Thanks again for being willing to help with this!

Top 3 was my idea, given that some 1 commit contributors can be ignored(can't they?). For license issue I don't know exactly the details, @xiaoxiang781216 knows better. I ran the tool on the whole proj already so those two directories can just be filtered. I'll try to get a report for particular files.
You're welcome :)

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@patacongo Are the original CVS and SVN archives saved anywhere?

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

@patacongo Are the original CVS and SVN archives saved anywhere?

No

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@patacongo Ok. I'll see if I can look through the commit message to see if I can see what's going on there.

I'm logged in to Bitbucket, but for some reason I can't view the graph link you posted. Maybe it's a permissions issue or I don't have access to the graphs addon?

from nuttx.

xiaoxiang781216 avatar xiaoxiang781216 commented on August 19, 2024

@PeterBee97 https://github.com/PeterBee97 Cool, thanks??? I didn't realize the script already used git to find the authors, sorry for missing that. We will need all the authors, not just the top 3. I'll take a closer look tomorrow.
I mentioned this before, but it bears repeating. The NuttX project was 13 years old in February of 2010. For the first 6 to 6 and a half years, the project used CVS and SVN. You will find no authorship or contact information for the first half of the project's life in the current GIT authors. The log will show me as the sole author for during that time. I did by far most the changes in those days, but not all. Prior to GIT, contributors were noted only in commit comments. It should be possible to get the names, or in most cases just user handles, from the comments but with no contact information. Github apparently does not even know how to parse that early activity. If you look at https://github.com/apache/incubator-nuttx/graphs/contributors you would conclude that the project has only existed since sometime in 2013. The project was actually created in February of 2007. This is clearer in the Bitbucket statistics[1]: https://bitbucket.org/nuttx/nuttx/addon/bitbucket-graphs/graphs-repo-page#!graph=contributors&uuid=4430abf9-a782-49ff-bd16-bc1df696048e&type=c&group=weeks which goes all the way back to the day the project was created. I think that is because prior to GIT, authors were NOT referenced by email address, but rather with some UUID. [1]Note you have to be logged into Bitbucket to see the statistics there.

@PeterBee97 can we add a column in the database to indicate the source code exist before git is used? @patacongo, we need gather the statistics information first and convert the unambiguous code base automatically(of course we need review the PR carefully) and then work on the rest case by case, otherwise NuttX can never become the TOP LEVEL PROJECT.

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@xiaoxiang781216 @patacongo @PeterBee97 I cloned the Bitbucket repo last night (https://bitbucket.org/nuttx/nuttx/src/master/), looked through the commit logs, and I can see what @patacongo is talking about. I didn't compare to the github log, but we should probably also do that. Then we can see if we can do anything with the information there.

It seems like we should be able to come up with a strategy for dealing with this:

  1. If we can get names and contact info from the commit messages, then we can run the license clearing process we already have, maybe with some additional steps about that process.
  2. At the very least, we can collect statistics about how many contributors we are talking about.
  3. If we can't get names and contact info from the commit messages, then we need to get help to address what @xiaoxiang781216 is talking about, so NuttX can graduate from podling status. Surely other Apache projects have faced this same issue.

Let me know if you have other thoughts about this.

@PeterBee97 Will you clone the Bitbucket repo and look at the logs to see if you have some insight about it?

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

This is also informative:

git log | grep author

The will produce over 30 thousand lines but you clearly see that the last several thousand commits have author:

patacongo patacongo@42af7a65-404d-4744-a932-0658087f49c3

That, I think is a bogus email that was created when the SVN repository was converted to GIT.

Then there are several thousand with author:

Gregory Nutt [email protected]

That is GIT, but when I was still using GIT as though it were SVN with no authors.

The first author that is not me appears at:

commit b0507038494cd1ae9d14807db758d4e3ae98a1ef
Author: jeditekunum <[email protected]>
Date:   Sat Jan 24 14:31:35 2015 -0600

First step at porting to MoteinoMEGA.  LED shows assert failure at boot.  Appears to be short double blink, short off (~1sec), followed by 250ms toggle cycles.  Most of it derived from amber board.

So it appears that there is authorship information for the first 8 years. Only for the last 5 years.

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@patacongo @PeterBee97 If do git log --reverse and search for ' by ' I find commits like this:

commit f03cb0ff3ababdcc84245d75d795ab956d110e09
Author: patacongo <patacongo@42af7a65-404d-4744-a932-0658087f49c3>
Date:   Tue Mar 16 00:53:32 2010 +0000

    Bugfixes submitted by David Hewson


    git-svn-id: svn://svn.code.sf.net/p/nuttx/code/trunk@2543 42af7a65-404d-4744-a932-0658087f49c3

There are others. They seem to indicate patches or other code from contributors, committed by Greg.

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@patacongo Thanks for pointing this out again, I am sorry I didn't remember this.

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

Bugfixes submitted by David Hewson

David Hewson I know. We are connected on LinkedIn. He just started working for HPE. He did a some of the LPC31 port in the 2010 timeframe but has not been involved significantly since.

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

If do git log --reverse and search for ' by ' I find commits like this

"by" or "from" would both be good search keys. I also recorded the authors in the old ChangeLog files that were recently removed from the repositories because they are not used in the current workflow. That should be a complete list of authors except for a few trivial things like typo fixes that weren't normally included in the ChangeLog.

from nuttx.

PeterBee97 avatar PeterBee97 commented on August 19, 2024

@PeterBee97 Will you clone the Bitbucket repo and look at the logs to see if you have some insight about it?

I cloned the bitbucket repo today but the git log seems to be the same with that on GitHub...

So I found the latest ChangeLog from NuttX 9.0.0 RC0 and tried to filter out the names with keywords from|by and the help of some NLP library and put the results in names-changelog.txt. Also processed the git log in the same way and the result is names-gitlog.txt. Still the commit messages of earlier SVN commits are incomplete and many commits are authorless.

This may help cover some corner cases. Maybe we can open an issue and mention these users? But before that let's filter out the "safe" files first as @xiaoxiang781216 suggests.

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@PeterBee97 That's great! Less than 450 names in each file. The next steps are probably:

  • remove all the non-human-names (Atmel, CONFIG_SDIO_PREFLIGHT, etc.)
  • remove all the name of committers (they have ICLAs) - I manually made a list of committers
  • remove duplicates (may need to be done manually since there are typos in the names
  • merge the lists

Once this is done, it will give us a scope of how many people there are. Ideally we'd have a list of commits for each name, and only to the top N contributors... not sure what N should be, but looking at the data should tell us. Do you have an idea how to get a list of commits per name?

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

I cloned the bitbucket repo today but the git log seems to be the same with that on GitHub...

Yes, the Bitbucket repositories are read-only mirrors of the incubator repositories.

from nuttx.

patacongo avatar patacongo commented on August 19, 2024
* I manually made a [list of committers](https://github.com/starcat-io/incubator-nuttx/blob/feature/license-clearing-tools/tools/license-clearing/committers.txt)

A large number of people do not use there names on PRs or commits, but rather some username/handle. A few of these I know. For example, v01d is Matias Nitshe, raidenpl is Mateusz Szafoni. Both Matias and Mateusz are Committers. But there are many more that I don't know.

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@patacongo Yes– we should find a way to update the committer list and the contributor list with handles... I'll think of some ways to do that...

from nuttx.

PeterBee97 avatar PeterBee97 commented on August 19, 2024

@PeterBee97 That's great! Less than 450 names in each file. The next steps are probably:

  • remove all the non-human-names (Atmel, CONFIG_SDIO_PREFLIGHT, etc.)
  • remove all the name of committers (they have ICLAs) - I manually made a list of committers
  • remove duplicates (may need to be done manually since there are typos in the names
  • merge the lists

Once this is done, it will give us a scope of how many people there are. Ideally we'd have a list of commits for each name, and only to the top N contributors... not sure what N should be, but looking at the data should tell us. Do you have an idea how to get a list of commits per name?

I used this script to get the list from git log and earlier commits by @patacongo :

git log --no-merges --author=patacongo --pretty=format:"%h %s" > gp.txt
cat ng2.txt | xargs -n 1 -I pp grep "pp" gp.txt > commits-patacongo.txt
./name-commits.sh ng2.txt name-commits.txt commits-patacongo.txt

Result:(I didn't exclude enlisted committers yet)
https://github.com/PeterBee97/authors-tool/blob/master/name-commits-full.txt
The names with no commits may be issue reporters' names, or names of committers who only contributed to the apps repo (I only ran the above commands in nuttx repo). Also some names are mentioned in ChangeLog, but sadly there's no commit
authored by or mentioning them.

from nuttx.

Apache9 avatar Apache9 commented on August 19, 2024

Any updates here? I think this is only blocker issue to prevert us graduate, let's try to make progress.

Thanks.

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@Apache9 No progress since the last update, I've been busy with other things. I'll merge @PeterBee97's code today. Next we should generate a list of people and the total lines of code for each person. Then we could sort in reverse order and decide how many people we need to try to contact.

@PeterBee97 Can you help with this? Can you find out how many lines of code were in each commit, tie them to a person in our list, create a list that combines all lines of code for each person, and create a CSV sorted in reverse order by total lines of code contributed?

from nuttx.

xiaoxiang781216 avatar xiaoxiang781216 commented on August 19, 2024

@adamfeuer how about we convert the source code which satisfy:
1.The first commit come from git not svn or cvs
2.The copyright owner in source code already sign SGA or ICLA
3.All contributor from git log already sign SGA or ICLA

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@xiaoxiang781216 That would be a good first step for the conversion process. But as discussed on the mailing list, I thought we first wanted to do a rough total estimate of the entire project?

If we want to do both in parallel, then I think your idea will be a good start. We would need:

  • list of all contributors who have signed SGA or ICLA - right now we only have committers who I presume have signed ICLAs. I don't know how to get the complete list, do you?
  • list of all files for which
    • only ICLA committers are in the git log
    • first commit is not from svn or cvs
    • file's author headers match git author or author listed in git commit message

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

My recollection is not 100% clear, but I am recalling that @justinmclean mentioned in very early phases of this project that there were some legacy changes that could be just grandfathered in without following the full IP clearance process. I understood that this was necessary for other large, established Incubator projects as well.

If my understanding is correct, then I propose that we take get permission to "cut some corners" on the pre-GIT changes that have no author associated with the individual commits. In most cases, the author of those early changes will be noted as an author or copyright holder in the BSD license header. In fact, I think that is true of all significant early code contributions. I would propose that we only use the GIT author changes for any automated analysis.

pre-GIT means pre-2014 so we are referring to very old changes.

Resolution of any remaining issues in the license headers will have to be a largely manual process anyway. We will have to examine each BSD license header and resolve all authors and copyright claims anyway. This should include all of the significant, pre-GIT changes. So I think with my suggestion here, the job can be made doable and there will be no loss of authorship on any significant contributions.

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

In most cases, the author of those early changes will be noted as an author or copyright holder in the BSD license header. In fact, I think that is true of all significant early code contributions.

I can think of one very frequent case where this is not true. In many cases, people clone files from one location to another. This is particularly true under arch/ and boards/. You will discover many files that I wrote, that have me as the copyright holder and author but GIT will claim, incorrectly, that the person doing the PR/patch was the author. This will apply to several hundred files. There are cases where the info in the license header is more accurate than in the file header.

Third party code brought into the OS will have the same issue. The true author of the code is in the license header, not in the GIT log.

And there are places where people make mistakes in copying files without updating the license headers. For example, under net/ there are a few files that include some small bits of logic from Adam Dunkels. I see that those files with headers have been cloned numerous times and most are no longer correct. Adam Dunkels is not the author of any of the files under net/ (except perhaps some logic under net/sixlowpan and the TCP state machine and even those are very highly customized).

It is all very complex and we cannot expect to get it all 100% correct. I think we just have to keep a high level of integrity and do our best effort to discover and document all authorship.

I think the point is that GIT authors may not agree with the authorship in the license header and those will all need some clarification.

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@xiaoxiang781216 @patacongo I updated my comment above to include "file's author headers match git author or author listed in git commit message" – that handles the cases where things would match up easily.

Yes, there are a bunch of files that won't match up or are confusing... I think we just need to get a count of how many there are to see what it will take to track down the ones that matter.

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

from nuttx.

justinmclean avatar justinmclean commented on August 19, 2024

from nuttx.

justinmclean avatar justinmclean commented on August 19, 2024

from nuttx.

xiaoxiang781216 avatar xiaoxiang781216 commented on August 19, 2024

Hi
What confuses me though is that we're worrying about git authors whereas I believe that if someone contributes a file without listing themselves as the authors in the header (for the BSD case), didn't the author concede rights over the code by doing so?
Without an ICLA (or an equivalent) this is not the case. Copyright automatically applies. They may not even own rights to the code they commit if their employment contract says otherwise. Thanks, Justin

So @justinmclean is it safe we do the batch conversion if the source code meet all following critieria?
1.The source code isn't converted from SVN or CVS
2.All commiters(or his company) in git log sign ICLA or SGA
3.The copyright holder in the source code sign ICLA or SGA
And I also have one queston: do we need the contributor to sign ICLA if he/she just modify a small portion of code(e.g. ~10 lines)? The quantity number is also important to write an automation tools .

from nuttx.

justinmclean avatar justinmclean commented on August 19, 2024

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

Take care with this. The copyright holder in source may or may not be the correct one.

Similarly, the author in GIT may not be the author of the file. Often the copyright holder in the source file header is the correct one, even though that person many not appear in GIT history.

Many people copy files that wrote into different locations (very often for new architectures and for new boards which are very similar to older architectures and boards). Very often, I am the author of the file in these cases.

Bottom line: There is no magic, automated way to correct determine the author. It requires collecting data and then also applying human insight.

@justinmclean https://github.com/justinmclean For many cases there are multiple contributors of changes to a file. There is an original author, the original committer (who might be a different person) and people who have made trivial changes (as trivial as a spelling fix) or who have made substantial enhancements or re-designs. The former would not be treated as authors or copyright holders, but the latter may be. Is there any rule of thumb for what constitutes a significant change warranting rights to the file? Or does this also require human insight.

There are thousands of files involved here. This is potentially multiple man years of effort. I don't see how we can ever accomplish this.

from nuttx.

protobits avatar protobits commented on August 19, 2024

We can only operate on the information we have. If authorship information was lost from CVS and SVN era (git author is Greg) and the header does not list anyone else than Greg, we can either "play safe" and leave the BSD header (we would respecting original authors license even if we don't know who it really was) or assume that without further information the original author cannot prove authorship either then we are safe to change to Apache. For these "unknown" cases, I don't see any other way. We just need to decide and then act.

For other cases where there is indeed information I think we can script a header change based on various scenarios of git author/header author/author aliases where all have ICLAs. This change can be made to create one commit per file change and add the reason for the safety of the change to the commit message for traceability. Then, we can review each commit in a PR and decide if manual intervention is needed (throwing out unsafe changes, for example).

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

We can only operate on the information we have. If authorship information was lost from CVS and SVN era (git author is Greg) and the header does not list anyone else than Greg, we can either "play safe" and leave the BSD header (we would respecting original authors license even if we don't know who it really was) or assume that without further information the original author cannot prove authorship either then we are safe to change to Apache. For these "unknown" cases, I don't see any other way. We just need to decide and then act.

In the SVN/CVS days, I did always give credit to the contributor in comments. However, the task of reading all comments in those 15 thousand or so commits is a very onerous task. The information is there, just not easily accessible.

AFAIK there are no un-credited changes in the repositories.

from nuttx.

protobits avatar protobits commented on August 19, 2024

We can try to see what wording you used in general and use some regular expression to try to match the attribution.

What I'm thinking is that in any case we will always need to analyze a file by looking at its complete git history to extract git author + header author + commit msg attribution right? The "easy" cases would then be files only touched by current commiters.

from nuttx.

justinmclean avatar justinmclean commented on August 19, 2024

from nuttx.

Apache9 avatar Apache9 commented on August 19, 2024

Let's clear the license for the files we own first. I think it is OK to have some files under compatibile licenses for a ASF project. You just need to mention them in the NOTICE file. And there is another possible solution is to rewrite these files so we can change the license. Anyway, this depends on the number of files we can not change license.

Thanks.

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

from nuttx.

Apache9 avatar Apache9 commented on August 19, 2024

I think @xiaoxiang781216 has already found someone wish to help here? But anyway, we need at least a committer to review the work...

from nuttx.

protobits avatar protobits commented on August 19, 2024

I've been writing some scripts which convert the output of git log (over a given file) into JSON format, to obtain metadata for each revision of the file. The final JSON contains (among other information): commit author, commit message and blob hash for the file.
I then started writing a python script to parse the JSON and extract (using regular expressions) authors from commit message and file header, in each commit. It is working nicely so far.
The final goal would be to determine if a given file passes the previously discussed checks for the easy cases that can be moved to Apache header. The python script could also be used to make the header change and commit the result.

I will work a bit more on this and open a draft PR (to add the script inside tools/).

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

I've been writing some scripts which convert the output of git log (over a given file) into JSON format, to obtain metadata for each revision of the file. The final JSON contains (among other information): commit author, commit message and blob hash for the file.

People have been using Fossology to get historical information: https://www.fossology.org/

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

Yeah, life intervened and I haven't been able to get back to this. I have less time for it than I thought.

@PeterBee97 made some progress in parsing out the list of contributors from the Git log messages. I will see if I can take his list and see if I can get a list of files and also number of lines of code for each contribution... anyway that seems to be the next steps:

  1. get a list of people who contributed
  2. get a list of the commits they were involved with
  3. work out how many lines of code per person are involved
  4. sort the list largest to smallest – this will give us an idea of how big the job is
  5. try contacting people with the n largest contributions

There are several other approaches. This is just the one that seems most straightforward to me. If anyone wants to help, we could use help with:

  1. writing a script that could take a list of commits and output the contribution size in lines
  2. getting a list of names and commits from the git log (Peter's scripts are this, or very close I think)

from nuttx.

protobits avatar protobits commented on August 19, 2024

Please see #1834

I know @PeterBee97 started some of this work but to be honest it was quite difficult for me to take advantage of those, considering it was based on sqlite databases. I chose JSON format since it is quite easy to read and parse with different programming languages.

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

Please see #1834

I know @PeterBee97 started some of this work but to be honest it was quite difficult for me to take advantage of those, considering it was based on sqlite databases. I chose JSON format since it is quite easy to read and parse with different programming languages.

I have to be in favor of anything that makes forward progress.

from nuttx.

adamfeuer avatar adamfeuer commented on August 19, 2024

@patacongo Re: anything that makes forward progress, me too.

@v01d yes, text-based json or csv/tsv formats would be great. The scripts in #1834 look cool. Maybe we combine them into one python script with the sh module. I'll try them out.

from nuttx.

protobits avatar protobits commented on August 19, 2024

@v01d yes, text-based json or csv/tsv formats would be great. The scripts in #1834 look cool. Maybe we combine them into one python script with the sh module. I'll try them out.

There's quite a bit of escaping going on in the bash script, so embedding it inside python would probably require some work. Not sure if it is worth it, but we can think about it.

from nuttx.

protobits avatar protobits commented on August 19, 2024

Comment moved to #1834

from nuttx.

protobits avatar protobits commented on August 19, 2024

Comment moved to #1834

from nuttx.

protobits avatar protobits commented on August 19, 2024

Oops, thought I was on the PR, I'll move the comments there

from nuttx.

yy-gu avatar yy-gu commented on August 19, 2024

@justinmclean @adamfeuer

Hi guys, we made some progress and post it here.
#1954

Basically, we collected the author/company list which have not signed the agreement. So the next step is to contact them via email and get them sign the agreement.

My questions are the following:

  1. Is there an email template for contacting the authors?
  2. Where do we return the signed ICLA to? Is there somebody from Apache Foundation to collect and verify them?

from nuttx.

justinmclean avatar justinmclean commented on August 19, 2024

ICLAs are emailed to [email protected] see https://www.apache.org/licenses/contributor-agreements.html

from nuttx.

yy-gu avatar yy-gu commented on August 19, 2024

@justinmclean Thanks!One more question, how would you normally contact companies to get their SGA signed? Do you contact people you know from the company to get introduced? What department is normally responsible for this?

For other authors, shall we just auto send email to contact them?

from nuttx.

yy-gu avatar yy-gu commented on August 19, 2024

@justinmclean One more question, shall we ask authors to send ICLA directly to [email protected]? Will someone from Apache Secretary process the mails and update the list and sync with us on the author list?

from nuttx.

patacongo avatar patacongo commented on August 19, 2024

I think this issue can be closed:

  • It is inactive. There have been no comments since 2020
  • NuttX has since graduated to a TLP so all IP clearance issues must have been resolved.

If there is something I am missing please just re-open.

from nuttx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.