House Sales: Data Preprocessing

房屋销售:数据预处理

Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, scikit-learn, NLTK

Final Project

This project implements a complete exploratory data analysis (EDA) workflow on a real-world housing sales dataset containing over 160,000 records. The analysis includes handling missing values, detecting outliers, and converting data types for categorical, numerical, datetime, and text features. Categorical variables were cleaned and filtered using frequency thresholds and cumulative distribution; numerical features were normalized and log-transformed; and textual summaries were tokenized, lemmatized, and vectorized to extract frequent patterns. The project also includes feature engineering (e.g., price per square foot) and correlation analysis to support further modeling or business insights.

本项目针对一个包含超过 16 万条记录的真实房屋销售数据集,执行了完整的探索性数据分析(EDA)工作流程。分析内容包括缺失值处理、异常值检测,以及对分类、数值、日期时间和文本特征的数据类型转换。项目通过频率阈值与累积分布方法清理和过滤类别特征,对数值特征进行归一化与对数变换,并对文本摘要执行分词、词形还原与向量化,以提取常见描述模式。此外,还构建了派生特征(如每平方英尺价格),并进行了相关性分析,为后续建模或业务洞察提供支持。

Data Processing

Categorical Data(分类变量)

To clean the “High School” column, whitespace was removed and all strings were converted to lowercase to standardize the text format. Irrelevant entries (such as URLs) were filtered by retaining only values ending with “high school.” Semantically redundant categories were manually merged (e.g., “Venice Senior High School” was unified as “Venice High School”). To reduce noise caused by the long tail of rare categories, a cumulative frequency analysis was conducted. Based on the 95% coverage rule, school names appearing fewer than 38 times were removed, reducing the number of categories from over 900 to 457 and improving data quality for subsequent analysis.

为了清理“高中”列,首先去除空格并将字符串转换为小写来标准化文本格式。不相关的条目(例如 URL)通过仅保留以“高中”结尾的值而被过滤掉。语义冗余的类别通过手动合并(例如,“威尼斯高级中学”被统一为“威尼斯高中”)。为了减少长尾类别带来的噪音,还进行了累积频率分析。基于 95% 覆盖率规则,出现次数少于 38 次的学校名称被移除,从而将类别数量从 900 多个减少到 457 个,可以提高后续分析的数据质量。

Numerical Data(数值变量)

To clean the “Lot size” column, the data was first converted to float type to enable numeric operations. Descriptive statistics revealed extreme outliers, with a maximum value far exceeding the typical range. A boxplot and IQR analysis were used to assess the distribution, and an upper threshold of 100,000 was selected to remove implausible values. After filtering, the column was normalized using MinMax scaling to standardize the range between 0 and 1. A log transformation was also applied to further reduce skewness and improve the overall distribution shape for downstream analysis.

为了清理“地块面积”列,首先将数据转换为浮点型,以便进行数值运算。描述性统计数据显示存在极端异常值,最大值远远超出典型值范围。这里使用箱线图和四分位距(IQR)分析来评估分布情况,并选择 100,000 作为上限阈值,以去除不可信值。过滤后,使用 MinMax 缩放法对该列进行归一化,使范围标准化为 0 到 1 之间。此外,还进行了对数变换,以进一步降低偏度并改善整体分布形状,以便进行下游分析。

Text Data(文本变量)

To process the “Summary” column, missing values were filled with empty strings to ensure compatibility with downstream operations. All summaries were tokenized and lemmatized to reduce word variation and standardize textual input. A custom stopword list was constructed by extending the default English stopword set with domain-specific noise terms such as “ha.” The cleaned corpus was then vectorized using the Bag-of-Words method, with low-frequency words (appearing in fewer than five documents) and high-frequency words (appearing in more than 80% of documents) removed. After vectorization, the top 20 most frequent terms—such as “home,” “room,” and “bedroom”—were extracted to summarize the dominant patterns in the property descriptions.

为了处理“Summary”列,缺失值用空字符串填充,以确保与下游操作兼容。所有摘要都经过标记化和词形还原,以减少词汇变异并标准化文本输入。通过在默认英语停用词集中添加特定领域的噪声词(例如“ha”),构建自定义停用词列表。然后,使用词袋法对清理后的语料库进行向量化,移除低频词(出现在少于五篇文档中)和高频词(出现在超过 80% 的文档中)。向量化后,提取出前 20 个最常用的词,例如“家”、“房间”和“卧室”,以概括属性描述中的主要模式。