A comprehensive machine learning project that performs sentiment analysis on IMDb movie reviews using pre-trained BERT models. This beginner-friendly project demonstrates data science and machine learning applications in the entertainment industry.
- Binary Sentiment Classification: Classifies reviews as positive or negative with confidence scores
- Detailed Insights Extraction: Discovers what viewers liked/disliked about movies
- TF-IDF keyword extraction
- Aspect-based sentiment analysis (acting, plot, cinematography, music, etc.)
- Named entity recognition for actors and directors
- Advanced Text Preprocessing: Uses spaCy for stopword removal while preserving sentiment-bearing words
- Rich Visualizations: 10+ charts including word clouds, keyword comparisons, and aspect analysis
- Data Export: Save predictions to CSV and insights to JSON
This project uses the IMDb dataset from HuggingFace, containing:
- 50,000 movie reviews (25,000 train, 25,000 test)
- Binary sentiment labels (positive/negative)
- Pre-split and ready to use
Uses distilbert-base-uncased-finetuned-sst-2-english - a pre-trained DistilBERT model fine-tuned for sentiment analysis:
- Expected Accuracy: 85-90%
- No GPU Required: Works on CPU (though GPU speeds it up)
- No Training Needed: Ready to use out of the box
- Python 3.8 or higher
- pip package manager
- Clone this repository
git clone https://github.com/mtgrunt/IMDb-Sentiment-Analysis.git
cd IMDb-Sentiment-Analysis- Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Download spaCy language model
python -m spacy download en_core_web_sm- Launch Jupyter Notebook
jupyter notebook- Open
imdb_sentiment_analysis.ipynband run all cells
- Open the
imdb_sentiment_analysis.ipynbnotebook - Run all cells sequentially (Cell β Run All)
- The notebook will:
- Download the IMDb dataset automatically
- Load the pre-trained model
- Perform sentiment analysis
- Generate visualizations
- Export results to
outputs/directory
You can adjust settings in Cell 3 of the notebook:
CONFIG = {
'model_name': 'distilbert-base-uncased-finetuned-sst-2-english',
'max_samples': 1000, # Set to None for full dataset (50k reviews)
'batch_size': 16, # Adjust based on your RAM
'max_length': 512 # Maximum token length for BERT
}Tip: Start with max_samples=1000 for quick testing, then set to None for full analysis.
After running the notebook, you'll find:
-
predictions.csv: All predictions with metadata
- Columns: review_text, true_label, predicted_label, confidence_score, true_sentiment, predicted_sentiment, correct_prediction
-
insights.json: Comprehensive insights summary
- Dataset statistics
- Model performance metrics
- Top positive/negative keywords
- Aspect-based analysis results
- Review length statistics
review_length_analysis.png: Review length distributionsconfusion_matrix.png: Model performance visualizationconfidence_analysis.png: Confidence score distributionssentiment_distribution.png: Overall sentiment breakdownkeyword_comparison.png: Top keywords in positive vs negative reviewswordcloud_positive.png: Word cloud of positive reviewswordcloud_negative.png: Word cloud of negative reviewsaspect_analysis.png: Sentiment by movie aspectsperson_sentiment.png: Sentiment for mentioned actors/directorslength_analysis.png: Review length vs sentiment correlation
imdb-sentiment-analysis/
βββ imdb_sentiment_analysis.ipynb # Main implementation notebook
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ .gitignore # Git ignore rules
βββ outputs/ # Generated results (created after first run)
β βββ predictions.csv
β βββ insights.json
β βββ visualizations/
β βββ [10+ PNG files]
βββ data/ # Dataset cache (auto-downloaded)
The notebook extracts several types of insights:
- Top 30 keywords from positive reviews (e.g., "great", "excellent", "amazing")
- Top 30 keywords from negative reviews (e.g., "bad", "terrible", "boring")
- TF-IDF scoring to identify most distinctive words
Analyzes sentiment for 8 movie aspects:
- Acting/Performance
- Plot/Story
- Cinematography
- Direction
- Music/Soundtrack
- Pacing
- Dialogue
- Special Effects
- Identifies actors, directors, and other persons mentioned
- Analyzes sentiment context of mentions
- Shows which actors are associated with positive vs negative reviews
- Review length correlations with sentiment
- Confidence score distributions
- Prediction accuracy by confidence level
- 0 / Negative: The review expresses negative sentiment
- 1 / Positive: The review expresses positive sentiment
- Range: 0.5 to 1.0
- 0.5-0.7: Low confidence (uncertain prediction)
- 0.7-0.9: Medium confidence (reasonably certain)
- 0.9-1.0: High confidence (very certain)
- Accuracy: 85-90% on test set
- Processing Time:
- 1,000 samples: ~2-3 minutes (CPU)
- 25,000 samples: ~30-40 minutes (CPU)
- With GPU: 5-10x faster
Solution:
- Reduce
max_samplesin CONFIG (try 500 or 1000) - Lower
batch_sizeto 8 or 4 - Close other applications
Solution:
python -m spacy download en_core_web_smSolution:
- Reduce
max_samplesfor testing - Use
max_length=256instead of 512 - Consider using GPU if available
Solution:
pip install -r requirements.txt --upgradeSolution:
# In Cell 3, force CPU usage:
CONFIG = {
...
'device': 'cpu', # Force CPU
}The notebook includes an optional section (Cells 30-34) demonstrating how to fine-tune DistilBERT on the IMDb dataset:
- Expected accuracy improvement: 92-95%
- Requires GPU for reasonable training time
- Training time: ~30-60 minutes on GPU, several hours on CPU
- Good for learning about model training
Note: Fine-tuning is optional and not required for good results.
-
Basic Preprocessing (for BERT input):
- HTML tag removal
- Whitespace normalization
- Special character handling
-
Advanced Preprocessing (for insights):
- Custom stopword removal (preserves "not", "no", "very", etc.)
- Lemmatization with spaCy
- Token filtering
- Base: DistilBERT (66M parameters)
- Distilled from BERT-base (110M parameters)
- 40% smaller, 60% faster, 97% of BERT's performance
- Pre-trained on sentiment classification task
This project demonstrates:
- Using pre-trained transformer models (BERT family)
- Text preprocessing for NLP tasks
- Sentiment analysis techniques
- TF-IDF keyword extraction
- Aspect-based sentiment analysis
- Named entity recognition
- Data visualization for ML results
- Proper ML project structure
- transformers: HuggingFace library for BERT models
- datasets: HuggingFace datasets library
- torch: PyTorch for deep learning
- spacy: Industrial-strength NLP
- scikit-learn: Machine learning utilities
- matplotlib/seaborn: Visualization
- wordcloud: Word cloud generation
- pandas/numpy: Data manipulation
Potential improvements:
- Multi-class sentiment (5-star ratings)
- Topic modeling with LDA or BERTopic
- Temporal sentiment analysis
- Comparative analysis (multiple models)
- Interactive web dashboard with Streamlit
- Real-time review analysis API
This project is for educational purposes. The IMDb dataset is subject to HuggingFace's terms of use.
- IMDb dataset from HuggingFace
- DistilBERT model from HuggingFace Transformers
- spaCy NLP library
If you encounter issues:
- Check the Troubleshooting section above
- Ensure all dependencies are installed correctly
- Try with a smaller dataset first (
max_samples=1000) - Verify Python version (3.8+)
If you use this project for research or educational purposes, please cite:
- IMDb Dataset: Maas et al. (2011)
- DistilBERT: Sanh et al. (2019)