House Sales: Data Preprocessing

房屋销售:数据预处理

Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, scikit-learn, NLTK

Final Project

Conducted a exploratory analysis on a real-world real estate dataset (100k+ samples, 1700+ features) of house sale records in California. The project covered data cleaning, categorical/numerical feature transformation, outlier removal, normalization/log scaling, and custom feature engineering (e.g., price per sqft).

Applied techniques

  • One-hot encoding for house types and school names

  • Text vectorization (CountVectorizer + NLTK lemmatization) for listing descriptions

  • Date feature extraction and correlation heatmaps to identify spatial/socioeconomic trends

  • Cumulative distribution & IQR filtering to trim long-tail noise

  • Feature scaling (MinMax, log) for price, size, and distance variables