viernes, 17 de julio de 2020

Usando comandos básicos de Pandas

Usando comandos básicos de Pandas

In [1]:
import pandas as pd
import numpy as np

generando un dataframe con numeros aleatorios, en 4 campos y 120 entradas

In [2]:
OP = pd.DataFrame(np.random.rand(120,4))
OP
Out[2]:
0 1 2 3
0 0.531800 0.006758 0.738428 0.501040
1 0.608820 0.321357 0.885696 0.746284
2 0.311094 0.349098 0.019949 0.245917
3 0.856562 0.517401 0.764246 0.061915
4 0.214642 0.607574 0.862968 0.990592
... ... ... ... ...
115 0.030316 0.756523 0.295768 0.858615
116 0.061035 0.004693 0.368267 0.801569
117 0.960026 0.708683 0.852853 0.670227
118 0.155944 0.735773 0.415909 0.027869
119 0.515879 0.138349 0.830983 0.248404
120 rows × 4 columns

Del anterior DataSet visualizar n primeras filas

In [3]:
OP.head(7)
Out[3]:
0 1 2 3
0 0.531800 0.006758 0.738428 0.501040
1 0.608820 0.321357 0.885696 0.746284
2 0.311094 0.349098 0.019949 0.245917
3 0.856562 0.517401 0.764246 0.061915
4 0.214642 0.607574 0.862968 0.990592
5 0.860340 0.872583 0.138736 0.235088
6 0.945173 0.252427 0.392737 0.213455

Mostrar ultimas n filas

In [4]:
OP.tail(7)
Out[4]:
0 1 2 3
113 0.131177 0.874260 0.831062 0.863078
114 0.789923 0.263941 0.385318 0.919080
115 0.030316 0.756523 0.295768 0.858615
116 0.061035 0.004693 0.368267 0.801569
117 0.960026 0.708683 0.852853 0.670227
118 0.155944 0.735773 0.415909 0.027869
119 0.515879 0.138349 0.830983 0.248404

Número y tipo de filas y columnas

In [5]:
OP.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       120 non-null    float64
 1   1       120 non-null    float64
 2   2       120 non-null    float64
 3   3       120 non-null    float64
dtypes: float64(4)
memory usage: 3.9 KB

Estadisticas de la columna

En este caso, de todas las columnas. Ojo no tiene estadisticas complejas, como la Coeficientes de curtosis

In [6]:
OP.describe()
Out[6]:
0 1 2 3
count 120.000000 120.000000 120.000000 120.000000
mean 0.503812 0.578188 0.492851 0.474296
std 0.295052 0.296464 0.285194 0.304249
min 0.004593 0.004693 0.004847 0.000375
25% 0.217390 0.320945 0.265610 0.214420
50% 0.519755 0.646808 0.486191 0.494482
75% 0.753617 0.844045 0.738208 0.752004
max 0.978121 0.988870 0.988131 0.996214

Valores únicos para todas las columnas

In [7]:
OP.apply(pd.Series.value_counts)
Out[7]:
0 1 2 3
0.000375 NaN NaN NaN 1.0
0.000713 NaN NaN NaN 1.0
0.004593 1.0 NaN NaN NaN
0.004693 NaN 1.0 NaN NaN
0.004847 NaN NaN 1.0 NaN
... ... ... ... ...
0.988131 NaN NaN 1.0 NaN
0.988870 NaN 1.0 NaN NaN
0.990592 NaN NaN NaN 1.0
0.993591 NaN NaN NaN 1.0
0.996214 NaN NaN NaN 1.0
480 rows × 4 columns

Renombrar columnas

In [8]:
OP.columns = ["a", "b", "c", "d"]
OP
Out[8]:
a b c d
0 0.531800 0.006758 0.738428 0.501040
1 0.608820 0.321357 0.885696 0.746284
2 0.311094 0.349098 0.019949 0.245917
3 0.856562 0.517401 0.764246 0.061915
4 0.214642 0.607574 0.862968 0.990592
... ... ... ... ...
115 0.030316 0.756523 0.295768 0.858615
116 0.061035 0.004693 0.368267 0.801569
117 0.960026 0.708683 0.852853 0.670227
118 0.155944 0.735773 0.415909 0.027869
119 0.515879 0.138349 0.830983 0.248404
120 rows × 4 columns

mas informacion

https://www.dataquest.io/blog/pandas-cheat-sheet/?utm_source=Dataquest+Blog+Subscribers&utm_campaign=905c386f3f-Blog_Post_2017_02_21_pandas_cheat_sheet&utm_medium=email&utm_term=0_9436fa3dc8-905c386f3f-150782837

En el caso de que se tenga una entrada, pero varios index estos se escalan

In [9]:
s = pd.Series(5, index=[0, 1, 2, 3])
s
Out[9]:
0    5
1    5
2    5
3    5
dtype: int64
Share: