Giter Site home page Giter Site logo

miladbaf / cs412_term_project Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 1.0 12.84 MB

This repository contains a collection of codes and scripts developed as the project for Machine Learning course during the Fall 2023 semester at Sabanci University.

License: MIT License

Jupyter Notebook 100.00%

cs412_term_project's Introduction

CS412 (Machine Learning) Term Project

Student working on ML Homework

Student working on ML Homework with some AI help. Credit: DALL-E

Overview of the Repository

This repository contains various scripts and code pieces used to analyze and predict grades based on ChatGPT interactions. The primary dataset includes HTML files of ChatGPT prompts/answers and a Jupyter notebook (assignment.ipynb) containing assignment questions. The project's main Jupyter notebook is here and Google Colab

Key Components

  1. Data Extraction and Imputation : Extracted text from HTML files of ChatGPT prompts to a JSON file. For IDs with missing text, imputed text from files with mode size.
    • Question-Answer Pair Extraction: Analyzed chat texts to extract question-answer pairs from them. Questions start with "Anonymous" and responses with "ChatGPTChatGPT".
    • Reading Assignment Questions: Extracted questions from the assignment.ipynb, specifically from markdown cells in the "source" part.
  2. Data Visualization: Visualized score data to understand the distribution and identify null data.
  3. Similarity Calculation: Computed similarities between assignment questions and user prompts, adding this data to the JSON file for each ID and each question.
  4. Histograms of Similarities: Plotted histograms of similarities for each question and each ID, calculating average similarity as a predictive feature.
  5. Linear Regression Models: Multiple linear regression models were trained to predict grades based on various features like average similarities, prompt length, number of prompts, average sentiment, response length, frequency of word "error" in prompts and frequency of using the word "error" in back to back prompts.

Methodology

This project adopts a comprehensive and multifaceted approach to predict the scores based on ChatGPT interactions. The methodology employed in this project encompasses various stages of data processing, feature extraction, and predictive modeling, each contributing to the overarching goal of understanding and predicting user scores.

Data Processing and Preparation

  • Data Extraction: The project begins with the extraction of ChatGPT interactions from HTML files. Special attention is given to handle malformed files, where missing data is imputed using the text from files with the mode file size, ensuring a complete dataset for analysis.
  • Data Visualization: Initial exploration of the scores.csv data involves visualization to assess the distribution of scores and identify any missing data, which is subsequently imputed using the mean of the column.

Feature Engineering

  • Extraction of Question-Answer Pairs: The project focuses on extracting question and answer pairs from ChatGPT interactions, identifying questions with "Anonymous" and responses with "ChatGPTChatGPT".
  • Assignment Question Analysis: Questions from the assignment.ipynb are extracted and analyzed, particularly from markdown cells in the "source" part.
  • Similarity Calculation: A key aspect of the methodology involves calculating the similarities between the text of assignment questions and user prompts, integrating this information into the dataset for each user ID.

Predictive Modeling

  • Development of Linear Regression Models: Various features such as average similarities, prompt length, number of prompts, average sentiment, the length of GPT responses, frequency of word "error" in prompts, and frequency of using the word "error" in back to back prompts are utilized to train linear regression models.
  • Performance Evaluation: Each model's effectiveness is assessed using Mean Squared Error (MSE) and R-squared values, allowing for a comparative analysis of different predictive features.

Insights and Conclusions

  • Comparative Analysis of Features: The project identifies which features have the most significant impact on predicting scores. It was observed that the number of prompts and the total number of words in prompts show relatively better predictive performance.
  • Iterative Approach: The project's methodology is iterative, constantly refining the features and models based on the insights gained from data analysis and model evaluations.

Overall, the project methodology is characterized by its data-driven approach, leveraging natural language processing techniques and statistical modeling to derive meaningful insights and predictions from ChatGPT interactions.

Results

The experimental findings are supported by various figures and the following table summarizes the model performances.

Performance of the Models:

Feature Mean Squared Error R-squared Score
Average Similarities 41.99 -0.40
Total Number of words in Prompts 56.67 -0.89
Number of Prompts 55.60 -0.85
Average Sentiment 40.74 -0.36
Average Prompt Length 41.57 -0.39
Average Response Length 38.43 -0.28
Frequency of "error" in Prompts 43.16 -0.44
Back-to-Back "error" Counts in Prompts 45.08 -0.50

The observed results demonstrate a varied level of accuracy and effectiveness across different features in the prediction of scores. Despite the careful approach taken in the project, the limited size of the dataset and the narrow standard deviation, where 50% of the scores exceed 72, result in the models exhibiting relatively large errors in their predictions.

Figures (Features vs Grades):

avgsim vs grades

Fig 1. Average Similarities of Prompts/Assignment-Questions vs Grades


avgsim vs grades

Fig 2. Number of words in Prompts vs Grades


avgsim vs grades

Fig 3. Number of prompts vs Grades


avgsim vs grades

Fig 4. Average Sentiment of Prompts vs Grades


avgsim vs grades

Fig 5. Average Prompt Length vs Grades


avgsim vs grades

Fig 6. Average GPT Response Length vs Grades


avgsim vs grades

Fig 7. Frequency of "error" in Prompts vs Grades


avgsim vs grades

Fig 8. Back-to-Back "Error" Counts in Prompts vs Grades


Team Contributions

Milad Bafarassat: As the sole contributor to this project, I was responsible for all aspects, including data extraction and preprocessing, feature engineering, model development, analysis, and documentation. My role encompassed the entire pipeline from initial data handling to final model evaluation and reporting of findings.

cs412_term_project's People

Contributors

miladbaf avatar

Stargazers

Sadiq04 avatar

Watchers

 avatar

Forkers

gurkanseren

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.