
House Sales: Data Preprocessing
房屋销售:数据预处理
Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, scikit-learn, NLTK
Final Project
Conducted a exploratory analysis on a real-world real estate dataset (100k+ samples, 1700+ features) of house sale records in California. The project covered data cleaning, categorical/numerical feature transformation, outlier removal, normalization/log scaling, and custom feature engineering (e.g., price per sqft).
Applied techniques
One-hot encoding for house types and school names
Text vectorization (CountVectorizer + NLTK lemmatization) for listing descriptions
Date feature extraction and correlation heatmaps to identify spatial/socioeconomic trends
Cumulative distribution & IQR filtering to trim long-tail noise
Feature scaling (MinMax, log) for price, size, and distance variables