The otvoreni-akti from za-grad

Search index to be automatically rebuilt whenever documents.py is changed

Problem:
The Heroku app recently showed Server 500 errors upon any search, because I changed documents.py during the hackathon (this file decides how to index the acts and is closely linked to elasticsearch). Every time this file is updated, we should rebuild the index by running python manage.py search_index --rebuild.

Solution:
The search_index must be somehow rebuilt, either on a regular basis or upon every deploy. Exact solution to be evaluated.

The search for common terms like "Zagreb" and "točka" is very slow

Problem:
The search for common terms like "Zagreb" and "točka" is very slow.

Possible solutions:

Use Elasticsearch's built-in pagination/scan for the results instead of relying on Django template pagination.
Reduce the number of maximum results to 1000.

Add tests to website

Obey the testing goat...BAAAAAAAAAAAAAA

Funny video of Harry demoing TDD for Django:
https://www.youtube.com/watch?v=X9474CgJleg

Harry's book:
https://www.obeythetestinggoat.com/pages/book.html

Add celery tasks for automated scraping

Problem:

Currently, the scraper needs to be started manually from the Django shell.

Solution:

Add 2 celery tasks -
1.1 Task to do a rescrape of last 10 periods of acts every night at 3am, Croatia time.
1.2 Task to do a full scrape every night at 4am, Croatia time.

Add Croatian language support (switch between Croatian and English)

Have an option to show archived search results

Problem:

Otvoreni-akti relies on the Grad Zagreb website being online and reachable to show search results. If the mayor decides to take down the website, there should be an archive of the existing acts.

Possible solutions:

Provide a link next to the app to show an archived version of the act (plain-text).
Add a Textfield to the Act model to save the full HTML of the act (this solution is more beautiful but not preferable as it will require a complete rescrape).

Other thoughts:
Not sure if this will be a credible source of information because otvoreni-akti database is managed by a 3rd party (us) and there is no guarantee that this data has not been tampered with.

Search ordering by relevance / date etc

Problem:
There are no options for sorting of search results.

Solution:

Add options to sort the results by date (ascending and descending)
Add option to sort the results by relevance (this is the current default)

Remove "Fork me on GitHub" from site header

As stated in the title, it just gets in the way on mobile specifically.

Do away with .txt files for storing data and create DB models for them

simplify the folder structure

Let's try and create a simpler folder structure based on something like this:

https://github.com/metakermit/hellodjangorest

Also, I think we could rename the project itself from skupstina to otvoreni-akti. Then we could keep the name skupstina just for the Django app. That way in the future in theory we could have another Django app that deals with some other open data (just hypothetically, not saying we really will do this), like I don't know the data about companies in Zagreb and their public finance.

We could then have:

otvoreni-akti/ <- the root git repository

.gitignore
.env
manage.py
otvoreni-akti/ <- the Django root package
- init.py
- settings.py
- wsgi.py
- apps/
  - skupstina/
    - views.py
    - models.py

Improve Django default admin CMS to show more useful data

Do not use eval statement in search app

Add visual indicator in search results for attachments

Repair styling issues

Issues as listed by @ranajaydas:

It’d look nicer if the edges aligned with the main search bar (the period one is a bit to the left)
In mobile view, the boxes shrink a bit so the height isn’t the same as the search box so it looks a bit strange
We should add outline: none; to all the search boxes to remove the outlines that occur on selection
The additional search info drop down would look better if it’s inside the purple header to maintain visual consistency

Improved document classification and filtering by document type

1)

We have several types of Acts:

i) Zaključak (en = Determination or Conclusion)
ii) Obrazloženje (en = Deliberation or Explaination)

Do we already have tags as such in the database?

Each point about specific topic on a list of acts for specific fixed time period can have multiple attachments for (a) and/or (b), usually A i is followed by B, but A can also be single attachment on a point.

Example:

Topic: SOME TITLE
Acts: (list of attachments)

Zaključak
Obrazloženje
Zaključak
Obrazloženje

More info:

https://legal-dictionary.thefreedictionary.com/acts

act
1 the formally codified result of deliberation by a legislative body; a law, edict, decree, statute, etc. See ACT OF PARLIAMENT.
2 a formal written record of transactions, proceedings, etc., as of a society, committee or legislative body. --> (2) is definition where Mayor’s acts fit into.

https://legal-dictionary.thefreedictionary.com/determination

Can we show label / tag next to each title to show type in which it was sound?

2)

Can we add advanced filter (GUI or by setting a proper keyword in advanced search) to allow search in a specific file types? I.e. Google search

(https://support.google.com/webmasters/answer/35287?hl=en): filetype:doc

3)

I have already shown that specific points on list of acts can have multiple documents. Can we search by a keyword and expand that search displays all linked documents to a point with that keyword. i.e. searching “property” will list all documents that are attached to a point in which any of a childs/attachments have that keyword). It could be an optional “advanced” search parameter. This is because even though keyword is found in a specific document, whole story is said by the all attachments. This is just a thought and could require further discussion by the key users, but I thought it could have sense.

Refactor and code cleanup (modify this to add suggestions)

Example:

Remove duplication of base_url variable in various files

Fix the mobile view of the site:

Results get cut off
Dates once entered cannot be cleared

Changed ALLOWED_HOSTS

ALLOWED_HOSTS should not be ['*'] once the app is deployed to production

100% scrape and fixing of bugs

Refactoring of scrape engine to allow for full scrapes and partial scrapes

Ability to do partial scrapes will be useful when automating routine DB updates using celery

Add a description of the Otvoreni-akti website to the header/about section

Problem:
There is no explanation for what the app does.
"Add "user manual" - maybe some examples and general info (what are these acts and what kind of info do they contain) - so if person comes for the first time they can get idea what can they find/search for"

Solution:
Add a description to the header or create an about section.

Create custom context manager for requests (to handle max retries etc)

We are using requests in a lot of places and this is prone to failure. Instead of having setup and teardown code for checking maximum retries, I suggest adding a context manager for it.

For example, replace:

parse_complete = False
max_retries = 10
sleep_time = 10
print('Parsing ', subject['subject_url'])
while not parse_complete and max_retries > 0:
    try:
        subject_details = parse_subject_details(subject['subject_url'])
        parse_complete = True
    except exceptions.ConnectionError as e:
        parse_complete = False
        max_retries -= 1
        print('Connection Error while parsing {}:\n{}\n'.format(subject['subject_url'], e))
        print('Retrying...\n')
        time.sleep(sleep_time)
if max_retries == 0:
    print('Maximum retries exceeded. Please run the scraper again.\n')
    raise exceptions.ConnectionError

with something like:

with request_with_retry(url):
    do_something

Add Pytest to CI/CD for Heroku to run a test before accepting any changes

Problem:
The Heroku app recently showed Server 500 errors upon any search, because I changed documents.py during the hackathon (this file decides how to index the acts and is closely linked to elasticsearch). Every time this file is updated, we should rebuild the index by running python manage.py search_index --rebuild.

Solution:
Upon running pytest manually on Heroku, I was able to trace the error because two of the view tests failed. Pytest should therefore be a part of our CI/CD pipeline.

Automatic update of list of periods

Get user feedback

Add filtering for attachments

Standardize URLs to lowercase and date format to prevent duplicate DB entries

Set up hosting

Deploy the app online

Heroku basic working
Dokku basic working
static files using whitenoise (see https://github.com/metakermit/hellodjangorest)
database
celery
run scraper as a periodic celery task

Remove 'catch all' exceptions from scraper

Improve date filter on search page

Problem:

The date picker changes from browser to browser and on Chrome it shows dd-----yyyy
The date picker cannot be cleared once a date is entered on iOS.

Solution:

Use a custom CSS/JS date picker

Parsing of PDF attachments

Replace Bootstrap CSS with custom styles

Setup link to Sentry correctly. Currently no events being received by it.

Parsing of MS Doc(X) files

Advanced search & filtering

Would be cool to be able to search not just by terms, but also by filtering from and to dates.

Maybe something like:

To analyze keywords to find similes in Croatian (the snowball filter is currently only for English words)

Problem:
The current search assumes that words like Zagreb, Zagreba and Zagrebu are different and shows different results for each one. Ideally, a search for Zagreba should show results for all 3.
There is an option for snowball filtering in documents.py but it only works for English words.

Solution:

Add exact term search

For example, if I search for:

"Zagreb Park"

it should only show results for that exact term

za-grad / otvoreni-akti Goto Github PK

otvoreni-akti's People

Contributors

Stargazers

Watchers

Forkers

otvoreni-akti's Issues

1)

2)

3)

Recommend Projects

Recommend Topics

Recommend Org