Comments (3)
поток значений получается тут:
from parse import all_values
for text in all_values():
filter_value(text)
требуется отрефакторить cell.filter_value() с упором на читаемость, а следом уже на производиельность.
from parser-rosstat-kep.
- 1) изменил регулярку, пред. вариант ф-ции kill_comment мог упасть с эксепшном (выше отписывал кейс).
пред. вариант содержал такую часть в регулярке: [\d.,]* , т.е ПУСТОЕ кол-во цифр в частности проходило валидацию регуляркой, хотя смысла в этом нет.
предложенный вариант ловит в регулярке часть, именно являющуюся float-ом, вот этот кусок: (\d+[.,]?\d+) - [x]2) в случае неудачи заматчиться по регулярке, ф-ция kill_comment возвращает исходный текст
- 3) ф-ция as_float теперь в случае проблемы получения float'a возвращает False
- 4) в filter_value упростил некоторые проверки, убрал избыточную (выше отписывал, про то что 'if text == ""' лишняя
в новом варианте проброс наружу исключений не происходит (в случае проблемы парсинга вернется железно False).
добавил assert'ы с кейсами
from parser-rosstat-kep.
import re
COMMENT_CATCHER = re.compile("^\D*((?:0|[1-9]\d*)(?:[.,]\d+)?)\s*(?=\d\))")
def kill_comment(text, rx=COMMENT_CATCHER):
result = rx.match(text)
if result is None:
return text
return result.group(1)
assert kill_comment('0,12)abc)') == '0,1'
assert kill_comment('6762,31)2)') == '6762,3'
assert kill_comment("5.61test)2") == "5.61test)2" #not matched, should return source text as is
assert kill_comment(")") == ")" #not matched, should return source text as is
assert kill_comment("abc5.6 2)") == "5.6"
assert kill_comment("abc5.6123 2)") == "5.6123"
assert kill_comment("abc5.6123 2)3") == "5.6123"
assert kill_comment("abc5.6123 2)34") == "5.6123"
def as_float(text):
try:
return float(text)
except ValueError:
return False
def filter_value(text):
"""Converts *text* to float number assuming it may contain 'comment)'
or other unexpected contents.
Returns parsed float value on success, False otherwise.
"""
if not text or text == "…" or text == "-":
return False
if text[0] != " " and " " in text:
return filter_value(text.split(" ")[0])
if ")" in text:
text = kill_comment(text)
if text[-1] == ",": # 97.1.
text = text[:-1]
return as_float(text.replace(",", "."))
assert filter_value(None) == False
assert filter_value("") == False
assert filter_value(" ") == False
assert filter_value("…") == False
assert filter_value("-") == False
assert filter_value("a") == False
assert filter_value("ab") == False
assert filter_value(")") == False
assert filter_value("5.61test)2") == False
assert filter_value("5.6test)") == False
assert filter_value('5.678,,') == False
assert filter_value("5.6") == 5.6
assert filter_value("5,6") == 5.6
assert filter_value('57,0') == 57.0
assert filter_value("5.67") == 5.67
assert filter_value("5.67,") == 5.67
assert filter_value('5,678') == 5.678
assert filter_value('6762,31) abc ') == 6762.3
assert filter_value('6762,31)2)') == 6762.3
if __name__ == "__main__":
import time
from parse import all_values
start = time.clock()
for text in all_values():
z = filter_value(text)
end = time.clock()
print(end - start)
from parser-rosstat-kep.
Related Issues (20)
- add Vintage.upload() method HOT 1
- processed/latest folder needs better handling
- certain variables not found in Vinatage.validate() HOT 2
- review check procedure HOT 7
- Missing values should not be False at dataframe construction HOT 5
- shorter decimal representation in CSV file HOT 2
- replace Table class with Table2
- add coverable badge
- adapt code to create html with headers and charts HOT 8
- code review for `dev-sceleton` branch
- speed up manage.parse() HOT 3
- create parsing definition for 'profit' variable
- start of minimal example in julia HOT 1
- start of minimal example in go
- clean notebooks folder and dev_scrap branch
- duplicate code: get_year() vs clean year()
- why smaller code has longer running time?
- trace where duplicate values are created
- how to control warnings issue?
- industial goods production
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parser-rosstat-kep.