GetStart: https://pandas.pydata.org/docs/getting_started/index.html#getting-started
Tutorials: https://pandas.pydata.org/docs/getting_started/tutorials.html#communitytutorials
GetStart
推荐在 venv 虚拟环境中安装。
1 2 3 4
| # pip>=19.3 pip install pandas # 安装所有依赖,包含可选依赖 pip install "pandas[all]"
|
运行测试套件。
1
| pip install "pandas[test]"
|
1 2
| import pandas as pd pd.test()
|
依赖
必要依赖:
可选依赖,有许多只用于特定方法的依赖:
pandas.read.hdf() 需要 pytables 依赖;
DataFrame.to.markdown() 需要 tabulate 依赖;
- 若对应的依赖未安装,pandas会抛出
ImportError 异常;
性能
推荐安装,尤其在处理大数据集合时。
1
| pip install "pandas[performance]"
|
- numexpr:通过使用多核以及智能分块和缓存来加速某些数值运算,以实现较大的速度;
- bottleneck:通过使用专门的cpython协程来加速某些类型的nan,以实现较大的加速;
- numba:接受engine=”numba”的操作的替代执行引擎,使用JIT编译器将Python函数转换为使用LLVM编译器优化的机器码;
可视化
1
| pip install "pandas[plot, output-formatting]"
|
- matplotlib:绘图库;
- Jinja2:使用datafframe .style的条件格式;
- tabulate:打印markdown友好的格式;
计算
1
| pip install "pandas[computation]"
|
- SciPy:各种统计功能;
- xarray:处理N维数据的类pandas api;
Excel
1
| pip install "pandas[excel]"
|
- xlrd:读取excel;
- xlsxwriter:写入excel;
- openpyxl:读写xlsx文件;
- pyxlsb:读取xlsb文件;
- python-calamine:读取xls/xlsx/xlsb/ods文件;
HTML
1
| pip install "pandas[html]"
|
- BeautifulSoup4:HTML parser for read_html
- html5lib:HTML parser for read_html
- lxml:HTML parser for read_html
XML
1
| pip install "pandas[xml]"
|
- lxml:XML parser for read_xml and tree builder for to_xml
SQL
1 2
| # 传统的可安装驱动 pip install "pandas[postgresql, mysql, sql-other]"
|
- SQLAlchemy:SQL support for databases other than sqlite
- psycopg2:PostgreSQL engine for sqlalchemy
- pymysql:MySQL engine for sqlalchemy
- adbc-driver-postgresql:ADBC Driver for PostgreSQL
- adbc-driver-sqlite:ADBC Driver for SQLite
其他数据源
1
| pip install "pandas[hdf5, parquet, feather, spss, excel]"
|
- PyTables:HDF5-based reading / writing
- blosc:Compression for HDF5; only available on
conda
- zlib:Compression for HDF5
- fastparquet:Parquet reading / writing (pyarrow is default)
- pyarrow:Parquet, ORC, and feather reading / writing
- pyreadstat:SPSS files (.sav) reading
- odfpy:Open document format (.odf, .ods, .odt) reading / writing
从Cloud访问数据
1
| python install "pandas[fss, aws, gcp]"
|
- fsspec:Handling files aside from simple local and HTTP (required dependency of s3fs, gcsfs).
- gcsfs:Google Cloud Storage access
- pandas-gbq:Google Big Query access
- s3fs:Amazon S3 access
剪贴板
依赖于操作系统,可能需要安装系统级别的包,如Linux中需要安装 xclip 或 xsel 的CLI工具。
1
| pip install "pandas[clipboard]"
|
压缩
1
| pip install "pandas[compression]"
|
- Zstandard:Zstandard compression
数据类型
表格数据,定义内部类型有:
DataFrame:二维数据结构,column可以存储的数据类型包括 字符、整型、浮点、分类数据等;
Series:DateFrame中的每一column都是一个Series
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| import pandas as pd
pd.DataFrame( { "Name": ["Aric", "Frank", "Max"], "Age": [22, 18, 30], "Sex": ["male", "male", "female"], } )
df["Age"] df.Age
ages = pd.Series([22, 18, 30], name="Age")
df["Age"].max() df.describe()
|
读写
DataFrame
pandas支持很多不同文件格式或数据源:csv、excel、sql、json、parquet等。
1 2 3 4 5 6 7 8 9 10 11 12
| import pandas as pd
titanic = pd.read_csv("data/titanic.csv") titanic titanic.head(5) titanic.tail(5) titanic.dtypes titanic.info()
titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
|
Series
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| import pandas as pd
titanic = pd.read_csv("data/titanic.csv") ages = titanic["Age"] ages.head() type(titanic["Age"])
titanic.shape titanic["Age"].shape
titanic[["Age", "Sex"]]
above_35 = titanic[titanic["Age"] > 35] above_35.head()
class_23 = titanic[titanic["Pclass"].isin([2, 3])] class_23.head()
age_no_na = titanic[titanic["Age"].notna()] age_no_na.head()
|
loc VS iloc
选择指定的行和列。
1 2 3 4
| titanic.loc[titanic["Age"] > 35, "Name"]
titanic.iloc[9:25, 2:5]
|
Plot
https://pandas.pydata.org/docs/user_guide/visualization.html#
折线图
1 2 3 4 5 6
| import pandas as pd import matplotlib.pyplot as plt
df.plot() df.plot.line() df.Age.plot.line()
|
散点图
1
| df.plot.scatter(x="Q1", y="Q2", alpha=0.5)
|
箱线图
可表示最大值、最小值、平均值。
1 2 3 4 5 6 7 8
| df.boxplot() df.plot.box() df.plot.box(vert=False, positions=[1, 3, 2, 4])
df = pd.DataFrame(np.random.rand(10, 2), columns=["Col1", "Col2"]) df["X"] = pd.Series(["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"]) plt.figure(); bp = df.boxplot(by="X")
|
面积图
1
| df.plot.area(figsize=(12, 4), subplots=True)
|
statistics
1 2 3 4 5 6
| import pandas as pd
titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False) titanic["Age"].mean() titanic[["Age", "Fare"]].median() titanic[["Age", "Fare"]].describe()
|