Giter Site home page Giter Site logo

Comments (3)

epogrebnyak avatar epogrebnyak commented on July 22, 2024

поток значений получается тут:

    from parse import all_values
    for text in all_values():
        filter_value(text)

требуется отрефакторить cell.filter_value() с упором на читаемость, а следом уже на производиельность.

from parser-rosstat-kep.

epogrebnyak avatar epogrebnyak commented on July 22, 2024
  • 1) изменил регулярку, пред. вариант ф-ции kill_comment мог упасть с эксепшном (выше отписывал кейс).
    пред. вариант содержал такую часть в регулярке: [\d.,]* , т.е ПУСТОЕ кол-во цифр в частности проходило валидацию регуляркой, хотя смысла в этом нет.
    предложенный вариант ловит в регулярке часть, именно являющуюся float-ом, вот этот кусок: (\d+[.,]?\d+)
  • [x]2) в случае неудачи заматчиться по регулярке, ф-ция kill_comment возвращает исходный текст
  • 3) ф-ция as_float теперь в случае проблемы получения float'a возвращает False
  • 4) в filter_value упростил некоторые проверки, убрал избыточную (выше отписывал, про то что 'if text == ""' лишняя
    в новом варианте проброс наружу исключений не происходит (в случае проблемы парсинга вернется железно False).
    добавил assert'ы с кейсами

from parser-rosstat-kep.

epogrebnyak avatar epogrebnyak commented on July 22, 2024
import re

COMMENT_CATCHER = re.compile("^\D*((?:0|[1-9]\d*)(?:[.,]\d+)?)\s*(?=\d\))")
def kill_comment(text, rx=COMMENT_CATCHER):
    result = rx.match(text)
    if result is None:
        return text
    return result.group(1)

assert kill_comment('0,12)abc)') == '0,1'
assert kill_comment('6762,31)2)') == '6762,3'
assert kill_comment("5.61test)2") == "5.61test)2" #not matched, should return source text as is
assert kill_comment(")") == ")" #not matched, should return source text as is
assert kill_comment("abc5.6 2)") == "5.6"
assert kill_comment("abc5.6123 2)") == "5.6123"
assert kill_comment("abc5.6123 2)3") == "5.6123"
assert kill_comment("abc5.6123 2)34") == "5.6123"

def as_float(text):
    try:
        return float(text)
    except ValueError:
        return False

def filter_value(text):
    """Converts *text* to float number assuming it may contain 'comment)'
       or other unexpected contents.
       Returns parsed float value on success, False otherwise.
   """
    if not text or text == "…" or text == "-":
        return False
    if text[0] != " " and " " in text:
        return filter_value(text.split(" ")[0])
    if ")" in text:
        text = kill_comment(text)
    if text[-1] == ",":  # 97.1.
        text = text[:-1]
    return as_float(text.replace(",", "."))

assert filter_value(None) == False
assert filter_value("") == False
assert filter_value(" ") == False
assert filter_value("…") == False
assert filter_value("-") == False
assert filter_value("a") == False
assert filter_value("ab") == False
assert filter_value(")") == False
assert filter_value("5.61test)2") == False
assert filter_value("5.6test)") == False
assert filter_value('5.678,,') == False

assert filter_value("5.6") == 5.6
assert filter_value("5,6") == 5.6
assert filter_value('57,0') == 57.0
assert filter_value("5.67") == 5.67
assert filter_value("5.67,") == 5.67
assert filter_value('5,678') == 5.678
assert filter_value('6762,31) abc ') == 6762.3
assert filter_value('6762,31)2)') == 6762.3

if __name__ == "__main__":
    import time

    from parse import all_values

    start = time.clock()
    for text in all_values():
        z = filter_value(text)
    end = time.clock()
    print(end - start)

from parser-rosstat-kep.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.