anttsou / qmj Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 8.0 317.99 MB

R 100.00%

qmj's People

Contributors

Stargazers

Watchers

Forkers

l1994z1116q3 dshen1 hjl2014 omegaphoenix patrick-j-close myso42 quant-tree jpzhaoo

qmj's Issues

Library qmjdata does not automatically load when loading qmj

Loading up qmj does not auto-load qmjdata, preventing me from doing
library(qmj)
data(companies) # Or equivalent

Will look into getting both to load when one or the other loads. Either that, or explicitly informing the user that the data is in a separate package "qmjdata"

Documentation for get_companies is unintuitive in explaining how it works.

The documentation implies the function automagically retrieves the data, formats it as a text file, and then reads it in. I intend to edit this to make the requisite steps clearer.

Working on this now. Will close when I submit the pull request.

?qmj leads to "No documentation for 'qmj' in specified packages and libraries.

??qmj returns the error:

Error in vignette_type(Outfile) : Vignette product �NA� does not have a known filename extension (�NA�)

Phantom Bugs # 17 and # 18

Issue #18 and Issue #17 bother me. Our data should be recent enough that I shouldn't be seeing these discrepancies behind my new data, and the old data.

Will think further on this, and will resolve both issues once I've come to terms with their conceptual existence.

get_prices - quantmod getSymbols function changed

It seems that the getSymbols function in quantmod has changed, leading to this bug:

When this line of code is run:

stockData <- quantmod::getSymbols("^GSPC", src="yahoo", auto.assign=FALSE)

Providing a function to clean temporary data

Specifically, this concern exists:

If I start get_prices, stop halfway through, and then resume get_prices a week later, get_prices will find the old temporary data and then resume the download. However, the range of stock prices will now differ between the data sets, corrupting our results to some degree.

Working on writing a function that cleans out all qmj temporary data in the temp directory should the user call it.

Issue: Financials Data Table Overfilled with Erroneous Data, Missing Some Gathered Data

Related to Issue #18 (as AAC isn't in our financials table at all) and Issue #24

get_info or tidyinfo is handling some data badly. Possibly incorrectly inserting anomalous data.

Ex: In our financials data frame:

However, looking directly at the quantmod data, we have:

CASHFLOWS:

BALANCESHEETS:

INCOME STATEMENTS:

In other words, either get_info or tidyinfo is poorly handling its receipt of information to incorrectly store data. Will look into this further.

Off-hand Thought: Dealing With Missing Information

In cases of NA or INF results for a calculation (almost always the result of missing information), our current way of coping with the result is by scaling it, and then setting certain values to 0; mostly in cases where the 0 value would have no effect on the resultant quality or quality component score.

Option A to Maximize Number of Quality Scores Produced: Produce component sub-scores Where Possible, Ignore Missing sub-scores

I'm concerned that companies with either questionable filings or missing data are not penalized, and the scaled values provide z-scores for companies which provide a maximal amount of data, a generally desirable attribute.

As a contrived example. Rigged Company A can produce documents which lead to an enormous growth score, but by creative accounting, gives us just enough information in the other categories to assign a neutral score.

Rigged Company A is thus judged as high quality, and a brief overview of the quality data set does not clearly reveal the lack of information.

On top of that, high-accuracy companies are "penalized" as their component sub-scores are judged only relative to other high-accuracy companies. The under-performers are removed, so their z-scores are absolutely lower to reflect that.

Option B to Maximize Accuracy: When a sub-value is missing, simply fail to produce the component score, and thus the quality score.

I'll have to double check what we do with specific sub-values, but following this method cuts us out of roughly 400 companies.

qmj package documentation file is out of date

If this were 1984, it would refer to un-datasets.

Do we want to set up a (relatively painless) way of updating/retrieving the companies from the Russell 3000 Index?

If memory serves me correctly, parsing the original data directly/programmatically is extremely painful, given the .pdf encoding that the Russell 3000 company list is saved as.

Given that the list only updates once a year, and that it's (relatively) simple to create a data frame like what we expect as input, I also wouldn't say this is crucial, but it's food for thought.

By the way, what I did to get the original list that we're currently using is to copy and paste the text into notepad, remove by hand some unwanted artifacts (typically something on the order of a "page end" marker), and then parsed the result with a line or two of R. Practically, the user would likely need to copy the text into an appropriate file and then call our function in order to produce the data frame.

get_companies() regex cuts out several companies when directly copying and pasting from the Component List

Specifically, when this issue pops up: TRIPADVISOR INC TRIPAs of 06/26/2015 Russell Indexes.

We're parsing for lines that are entirely capitalized, so TRIP isn't read in as a company. Should be easy to fix. Just get rid of the excess chunk of a line after reading in an "As of"

Observation: get_prices is slow to aggregate the various chunks of raw price data into a single data object

Slow enough that I wondered if R had either crashed or frozen.

A quick stop-gap measure to ensure the user's aware that work is happening is to set up a notifier telling the user when so-and-so is processed.

Speeding up the process should occur later down the line when more serious issues have been dealt with.

Worth Repeating explanation of Russell 3000 in Prices and Financials data documentation?

This is from an earlier push to the qmjdata repo, but one of the larger changes I implemented was to replace the detailed descriptions of the Russell 3000 Index in the prices and financials data sets with the single sentence "For more information on the Russell 3000 Index and why it was chosen, please see {link to companies data set}"

I don't see the repeated description of the Russell 3000 Index to be critical to understanding the Prices/Financials data sets, aside from mentioning that it's the source for the chosen companies.

However, I'm having second thoughts at the idea of an individual needing to go to another "page" if they wanted to find out more about the Index.

Your thoughts?

Impose consistency across variable names and function names

A non-small chunk of this is my fault.

There's code/variables that are lower-to-upper camel case, e.g., startDate
And there's also a large amount of code that uses underscore separation, e.g., start_date

Once the bigger things are taken care of, I'll go back and rewrite variables/e.t.c. to follow one style strictly, likely underscore separation.

If statement in market_data

Specifically this statement:

if(length(which(financials$TCSO < 0))) {
    stop("Negative TCSO exists.")
  }

Was in the midst of updating documentation, and I'm not quite sure I understand the stop message, or why TCSO is specifically chosen. Can someone explain it to me?

For Case Study: Reducing the Number of Companies for which we produce No Quality Score

Here's a small snippet of companies for which we produce no quality scores. The current number is 385. (I've yet to seriously tamper with how we handle missing data, which may reduce the list somewhat)

Taking AAC as a sample, with a side-by-side comparison of a company we do produce a quality score for (AMH), the main issue appears to be the absence of key figures.

INCOME STATEMENT:

BALANCE SHEET:

CASH FLOW:

The main issue appears to be the absence of a couple key figures. Growth, Payouts, and Profitability all use gross profits in their calculations, for example. Since quantmod currently only allows us to source our data from Google, it'll be worth looking into what figures are "reasonably implied" when not explicitly given. (Yahoo Finance, for example, does give gross profits of AAC as equal to total revenue).

Safety I'm unsure of. I'll be looking at this over the next few days as reference in order to try to reduce the number of non-quantified companies.

The back of my head tells me that the code should be robust enough that missing a single key figure here and there should still be able to produce a rough score, so something else may be up. I'll comment here and (ideally) resolve the issue once I address any/all reasonable means of producing a quality score for AAC.

Missing data?

Hi - this is a really useful package for exploration, thanks for making it!

However, I've been going through the readme and noticed that:

#And more detailed data sets into what makes up quality
data(profitability)
data(growth)
data(payouts)
data(safety)

does not seem to work. Is this data available?

Thanks,
Alex

Cleaning up tidy_prices

qmjdata is not available

In RStudio, running under Ubuntu, I entered the commands:
library(devtools)
install_github("anttsou/qmj")

The response to the latter command was:
ERROR: dependency ‘qmjdata’ is not available for package ‘qmj’
Installation failed: Command failed (1)

What is the correct procedure for installing qmj?
-John
[email protected]

README markdown file for github repo is badly, badly out of date

Some data sets no longer exist. Have not yet checked function statuses.