This project demonstrates the application of the Expectation-Maximization (EM) algorithm for the unsupervised clustering of news articles into thematic groups. The goal is to categorize large sets of news articles based on content, facilitating an understanding of the underlying themes without prior labeling.
Utilizing the EM algorithm, a probabilistic approach for maximum likelihood estimation in the presence of latent variables, this project categorizes news articles into thematically similar clusters. This approach enables the exploration and discovery of natural groupings within the data, shedding light on the diverse themes present in news content.
- Preprocessing: Documents are preprocessed using regex for format matching, CSR matrix for sparse representation, and tokenization for word encoding.
- Efficiency: The implementation leverages NumPy for vectorization, achieving a runtime of approximately 5 seconds.
- Analysis: Focuses on applying the EM algorithm to the preprocessed data to identify thematic clusters.
- Objective: Achieving a target accuracy score by optimizing parameters such as
lambda
andk
values.
- Python 3.x
- Libraries: NumPy, Pandas, Scikit-learn, Matplotlib
To replicate the analysis:
- Install dependencies:
pip install numpy pandas scikit-learn matplotlib
. - Execute the Jupyter Notebook (
Applied_probability_models_for_CS_Exercise_3.ipynb
) in a Jupyter environment.