Giter Site home page Giter Site logo

apache / flagon-distill Goto Github PK

View Code? Open in Web Editor NEW
10.0 10.0 14.0 16.14 MB

Apache Flagon Distill is a python package to support and analyze Flagon UserAle.js logs

Home Page: https://flagon.apache.org/

License: Apache License 2.0

Python 98.63% CSS 0.53% Makefile 0.84%
flagon apache behavioral-analytics business-analytics usability usage user-monitoring python jupyter pypi behavioral-sciences

flagon-distill's Introduction

Apache Flagon Distill

Documentation Status

This project is a work in progress, prior to an official Apache Software Foundation release. Check back soon for important updates.

Please see our readthedocs.org pages for documentation.

A contribution guide has been provided here.

Installation

To install and set up the Python project, Distill uses Poetry, a dependency and package management tool. Poetry simplifies the management of project dependencies and virtual environments, ensuring consistent and reproducible builds.

Prerequisites

Before you begin, make sure you have the following prerequisites installed on your system:

  • Python (>= 3.8)
  • Poetry (>= 1.0)

You can check your Python version by running:

python --version

This will return the version of Python installed on your system. If you do not have Python installed, you can download it from the official website. However, we recommend using a Python version manager such as pyenv. You can refer to this guide for setting it up: pyenv guide.

You can install Poetry a number of ways (see the Poetry docs for all methods). We recommend installing one of the following two ways:

Official Installer:

Linux, macOS, Windows (WSL)

curl -sSL https://install.python-poetry.org | python3 -

Windows (Powershell)

(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -

pipx:

pipx install poetry

The above two methods should minimize the chances of dependency conflicts with your system Python (global) installation. Some users have reported issues with Poetry using an incorrect Python environment instead of the project's local virtual environment when using regular pip method. If you run into issues, please refer to the official Poetry docs or Github for more in-depth installation instructions.

Installation Steps

Follow these steps to set up and install the project:

  1. Clone the repository:

    git clone https://github.com/apache/flagon-distill.git
  2. Navigate to the project directory:

    cd flagon-distill
  3. Use Poetry to install project dependencies and create a virtual environment:

    poetry install

    This command reads the pyproject.toml file and installs all required packages into a dedicated virtual environment.

  4. Activate the virtual environment:

    poetry shell

    You are now inside the project's virtual environment, which isolates the project's dependencies from your system-wide Python packages.

  5. Run the tests:

    You can now run the tests to make sure everything installed properly. For example:

    make test

    Remember that you need to activate the virtual environment (step 4) each time you work on the project.

Updating Dependencies

To update project dependencies, you can use the following command:

poetry update

This command updates the pyproject.toml file with the latest compatible versions of the packages.

Uninstalling

To uninstall the project and its dependencies, simply deactivate the virtual environment (if activated) by typing:

exit

This will exit the virtual environment. You can then safely delete the project directory.

By following these installation steps, you can easily set up and manage the Python project using Poetry. Enjoy coding!

flagon-distill's People

Contributors

amirmghaemi avatar broden222 avatar dependabot[bot] avatar eandrewjones avatar grtnation avatar hungryarthi avatar jlhitzeman avatar jyyjy avatar krassmann12 avatar lewismc avatar mdiep-cese avatar michellebeard avatar poorejc avatar rc10house avatar vl8x avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flagon-distill's Issues

chore(testing): add TOX for cross-version python testing

We should be testing our code across all supported python versions (>=3.8, <4.0) in our CI/CD pipeline. A good way to start is by adding the feature locally using tox. Add and configure tox, then add a makefile target for testing across all versions of python.

feat(LogsRepresentation): Add a class/model for ingested logs

Problem

We need to add a data class or baseline representation for logs (our most granular unit of data). Ideally we want to focus on UserAle logs, but should be able to ingest arbitrary logs as well that may come from other telemetry providers in the future. To do this we can use Pydantic: https://docs.pydantic.dev/latest/

Pydantic allows for data validation based on a given schema. To use it you create a custom class that inherits from a Pydantic model. We can then pass our logs with a double ** to unpack our JSON data to keyword arguments for the class constructor.

Below is an example of successful validation:

from datetime import datetime
from pydantic import BaseModel, PositiveInt

class User(BaseModel):
    id: int  
    name: str = 'John Doe'  
    signup_ts: datetime | None  
    tastes: dict[str, PositiveInt]  


external_data = {
    'id': 123,
    'signup_ts': '2019-06-01 12:22',  
    'tastes': {
        'wine': 9,
        b'cheese': 7,  
        'cabbage': '1',  
    },
}

user = User(**external_data)  

print(user.id)  
#> 123
print(user.model_dump())  
"""
{
    'id': 123,
    'name': 'John Doe',
    'signup_ts': datetime.datetime(2019, 6, 1, 12, 22),
    'tastes': {'wine': 9, 'cheese': 7, 'cabbage': 1},
}
"""

We also want to make sure we are using Python's type hints which can be extended via the typing module.

Definition of done:

  1. Can validate against a provided schema, default to a UserALE schema
  2. Should be able to convert non-UserALE schema logs to UserALE schema if the user provides a mapping.
  3. Can de/serialize from/to JSON

The above will need to be accomplished with class methods

Future goals:

  1. Can de/serialize from/to other supported formats

Chore(Pydantic): Pydantic throws linter errors which conflict with pre-commit

When creating some Pydantic Types especially in types.py we get some linter errors after running the pre-commit script or the makefile target. The Pydantic code is "correct" but the linter errors mean the pre-commit check won't let us commit this Pydantic code. We need to figure out a way to selectively ignore or correct certain "type errors" that show up.

getUUID refactor

This indeed fixes the problem. However, it's unclear why we need to construct an id from the individual field values in the log.

Questions:

* How is this function used elsewhere in the codebase?

* Do we want/need to be able to parse the id to retrieve those values in some way?

* Do we want our ids to have a partial-logical ordering?

If the only requirement is to generate a uuid, then why not just use str(uuid.uuid4()) and call it a day? We'll never get a collision.

Tagging @Jyyjy or @amirmghaemi

As the package is written, getUUID has to return the same value when the same log is passed in. str(uuid.uuid4()) will create a different uuid when the same log is passed. hash() might be a better option.

  1. getUUID isn't really used within the distill package. Users are expected to use it to create a dictionary mapping UUID to logs. Then that dictionary is whats passed to the segmentation functions. This is one of the biggest pains of working with distill, you have to manage the UUID's and dictionary of logs yourself.
  2. No, all that info is in the log, which the UUID (assuming the user set things up correctly) maps to.
  3. Not sure exactly what you mean. One of the assumptions of the segmentation functions is that the user sorts the log dictionary by clienttime.

Also, the reason @mdiep-cese ran into this issue is that interval logs have some inconsistencies in userale, and we have historically filtered out all interval logs. I'm not sure about the details, but that's been josh's guidance. This may be the relevant ticket. But my point is that nothing in this package is built to deal with interval logs.

Distill is even less mature than userale. The upside is that we can change things a lot without really affecting anyone. I'm team fresh rewrite.

Edit: found an old PR which sparked a discussion about this last year
UMD-ARLIS#18

Originally posted by @Jyyjy in #29 (comment)

Migrate from rst to markdown

The current release uses a mixture of .rst files (held over from pre-release code) and .MD files (introduced in first release) for documentation. Using a mixture of markup languages for documentation leads to confusion, especially for new contributors.

I recommend migrating everything to MD for simplicity's sake.

Feature(Schema): Support UserALE Interval Schema

Problem

The current implementation of the default UserAleSchema assumes logs are raw type. There is not explicit support for interval logs.

Solution

Need to make the schema more generic to cover both log types and write tests for both. This also means updating the implementation for the timestamp property as that relies on clientTime which is not present in interval logs.

feat(NormalizeTimestamp): Normalize UserAle and arbitrary log timestamps to UTC

Problem

We need to be able to normalize timestamps on logs that may be from different timezones and in different formats into a single UTC time for comparability. This ticket is slightly open ended, we could strictly define the different formats that we are going to accept for timestamps OR we could attempt to parse any given timestamp with something like dateparser. The normalization functions can then be called by the ID creation functionality when parsing JSON logs.

Definition of Done

  1. Determine strict timestamp format or open ended approach
  2. Create functionality to parse a timestamp string to UTC
  3. Make functionality available to log data class

Add function to segment logs into user sessions

What are "User Sessions"?

Most user behavior services provide some definition of a user "session" and then segment the log stream into sessions for further behavior. For example, LogRocket defines a session as:

A session is a series of user interactions on your site, beginning with the first page they visit and ending with either:
a.) a period of inactivity lasting longer than 30 minutes, or
b.) after the user has navigated away from your app for more than 2 minutes. This includes closing the tab or navigating to a different domain on the tab.

"Activity" is defined as any user mouse movement, clicks, or scrolls.

As an example, if your user visits your landing page, then your app, and then refreshes the page all within 30 minutes of each other without closing the tab, the entire experience is recorded in a single session. If the user returns back to your site after another hour, a new session recording starts from the moment that they do the first action.

LogRocket sessions also support recording across multiple tabs, so a user opening a link in your app in a new tab will count as the same session. This means that if your app is running in multiple tabs, each tab would need to be navigated away from in order to end a session after 2 minutes. Otherwise, it wouldn't end until a period of inactivity across all tabs lasting longer than 30 minutes.

Why do we need "User Sessions"?

Sessions are a particularly useful unit by which to analyze user behavior since they represent a logical clustering of activity. Answers to simple questions such as:

  • How long did the user's first session last?
  • How long are a user's session, on average?
  • What actions did the user perform in their session?
    all provide quite a bit of insight into whether and how users engage with an application. Generally speaking, they are a great entry point to begin building one's understanding of UX in your app.

Proposed change

We should add a method that segregates the entire log stream into appropriate session buckets according to some definition of a "user session." It need not necessarily be the LogRocket definition shared above; however, I am proposing that as a reasonable starting point.

feat(LogUniqueId): Create unique ID for logs

Tied to #40

Problem

We need our logs to have unique ID values we can use to index them in memory in an efficient manner. The IDs should be tied to the timestamp of the log so we can sort them and there should be no collision on ID values. When we start building out segment functionality, we want to be able to operate on these logs in a data structure efficiently (binary tree, min-heap, etc) so incorporating timestamp into the ID is important.

Approach

We should use Prefixed K-Sortable Unique IDentifiers (PKSUID) to accomplish this. A python module exists here. The "prefix" part allows for a prefixed identifier to the ID such as log_1032HU2eZvKYlo2CEPtcnUvl. This will allow us to use PKSUIDs for other objects in the future such as segments (ex: seg_1032HU2eZvKYlo2CEPtcnUvl).

Example

Below is an example of using the module. We want to make sure to set the timestamp ourselves with the timestamp value of the logs. These should be normalized to Coordinated Universal Time (UTC) (ticket here: #42)

from pksuid import PKSUID

# generate a new unique identifier with the prefix usr
uid = PKSUID('usr')

# returns 'usr_24OnhzwMpa4sh0NQmTmICTYuFaD'
print(uid)

# returns: usr
print(uid.get_prefix())

# returns: 1643510623
print(uid.get_timestamp())

# returns: 2022-01-30 02:43:43
print(uid.get_datetime())

# returns: b'\x81>*\xccDJT\xf1\xbe\xa9\xf3&\xe8\xa5\xb2\xc1'
print(uid.get_payload())

# convert from a str representation back to PKSUID
uid_from_string = PKSUID.parse('usr_24OnhzwMpa4sh0NQmTmICTYuFaD')

# this can now be used as usual
# returns: 1643510623
print(uid_from_string.get_timestamp())

# conversion to and parsing from bytes is also possible
uid_as_bytes = uid.bytes()
uid_from_bytes = PKSUID.parse_bytes(uid_as_bytes)

# returns: 2022-01-30 02:43:43
print(uid_from_bytes.get_datetime())

# all the standard comparison operators are available
import time
ts = int(time.time())

# OUR USE CASE
lesser_uid, greater_uid = PKSUID('usr', timestamp = ts), PKSUID('usr', timestamp=ts + 5)

# returns True
print(lesser_uid < greater_uid)

# except for the case of equivalence operators (eq, ne), the prefix is not taken into account when comparing
prefixed_uid_1, prefixed_uid_2 = PKSUID('diff', timestamp = ts), PKSUID('prefix', timestamp=ts + 5)

# returns True
print(prefixed_uid_1 < prefixed_uid_2)

Definition of Done

  1. Build functionality into log class to create ID when parsing JSON
  2. Normalize time to UTC -> #42
  3. Create PKSUID and store as field within log object

Remember to use Python type hints when implementing

feat(FeatureDefintion): Add support to label logs

Problem

Raw logs from UserALE often require additional manipulation before being analytics ready. For instance, although fields like path and target provide globally unique identifiers for specific objects in the DOM by utilizing CSS Selectors which are important for mapping an event back to it's source in the DOM, these selectors are not human interpretable. If you have a specific element, such as button, which is extremely important---say because it represents users interacting with a key feature or finishing an important step in a workflow---and you know it's CSS selector, then you'll likely want to add a label to that log to make it more interpretable as well as making downstream data wrangling (grouping, filtering, sorting) easier.

Distill does not currently have a nice built-in way to label logs. It's easy enough to do this in raw python, but we want to provide functions to help with this.

Proposed Solution

Let's create a primitive (a core library component) called Feature Definition that allows users to specify a rule or set of rules and an associated label and then use this to add labels to our logs in Distill.

Example of FeatureDefinition class interface and how we might use it:

class FeatureDefinition:
# Implement class logic

def map_rule(log: Dict[str, Any]) -> bool:
     return "pageUrl" in log and "map" in log["pageUrl"]

map_page_definition = FeatureDefintion(
    rule=map_rule,
    label="map_page"
)

label_features(logs=logs, definitions=[map_page_definition])

where FeatureDefinition is a class that has two properties:

  • rule: must be a callable function which accepts a UserALE log as an input, and returns a boolean of whether that rule was met or not
  • label: a string we want to add to the log if the rule is met
    Above I am using type hints (Dict[str, Any]) from the python typing library to loosely represent the structure of an ingested UserALE log.

The label_features function can be simple. For example:

def label_features(logs: List[Dict[str, Any]], definitions: List[FeatureDefinition]) -> List[Dict[str, Any]]:
    for log in logs:
        for definition in definitions:
            if definition.rule(log):
                if "labels" not in log:
                    log.update({"labels": defaultdict(list)})
                log["labels"].append(definition.label)
    return logs

I recommend to start by working with the existing sample JSON data we have in different places in the code base. We need to implement the logic for the FeatureDefintion class and chose where we want to place this code in the code base.

What are Feature Definitions

Feature definitions are analytic-friendly wrappers around the most crucial parts of your application. For example, if there is a particular CSS selector or URL pattern that maps to a particular feature or page ('Maps' page or 'tag object' button), you can define those once as a feature definition and leverage that feature throughout Distill.

feat(LogMagicMethods): Implement magic methods for log class to allow comparison operators

Problem

We need to be able to compare logs to each other using arithmetic operators (>,<,==,>=,...). This is possible by implementing magic methods (or dunder methods, i.e. __eq__(), guide here). We also need to decide which metric to compare the logs with; most likely we should use timestamp for chronological comparison–but this doesn't exactly make sense for the equals operator (==) so we will need to figure out a different metric for that.

Note: timestamp comparison with arithmetic operators becomes possible by default when using PKSUID and comparing on the ID objects. We could go this route by bubbling up the PKSUID comparison magic methods to the Log object itself. This would also allow us to redefine the ‘==’ comparator method to something other than timestamp if we want to.

Other notes

  • raw logs have a single time stamp
  • interval logs have a start and stop time stamp
  • We can compare two logs using comparison operators to see if one occurred before or after another
  • Log1 > Log2 iff Log1 min time strictly greater than Log2 max time
  • log1 < log2 iff Log1 min time strictly less than Log2 max time
  • log1 == log2 if Ids are identical (this is one option, in theory they could have different fields though if one of them has been mutated)
  • Need to think through whether this is too overloaded because behavior of operator and the underlying field it's using to compare depends on the operator itself

Definition of Done

  • Should be comparable with each other
  • Should be able to compare logs against a timestamp
    • Log1 < datetime.now()
  • Should be able to check if logs fall within a given interval:
    • datetime.datetime(year=2018, month=3, day=20) < log1 < datetime.datetime(year=2018, month=6, day=21)
  • If you add two or more logs, the resulting object should be a chronologically-ordered segment
  • Should support string magic methods

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.