sábado, 4 de julio de 2020

Script para seleccionar y trabajr con columnas y filtros de un DataSet

Script para seleccionar y trabajr con columnas y filtros de un DataSet

!/usr/bin/env python3

-- coding: utf-8 --

Created on Sat Jul 4 11:43:13 2020

@author: jeanwolf

In [1]:
### Importando las librerias
In [2]:
import pandas as pd
In [3]:
datos = pd.read_csv("data/atp.csv" , encoding="ISO-8859-1") #aqui se usa condicion encodig
print(datos.head())
   ATP  Location                          Tournament       Date  \
0    1  Adelaide  Australian Hardcourt Championships  3/01/2000   
1    1  Adelaide  Australian Hardcourt Championships  3/01/2000   
2    1  Adelaide  Australian Hardcourt Championships  3/01/2000   
3    1  Adelaide  Australian Hardcourt Championships  3/01/2000   
4    1  Adelaide  Australian Hardcourt Championships  3/01/2000   

          Series    Court Surface      Round  Best of       Winner  ... UBW  \
0  International  Outdoor    Hard  1st Round        3   Dosedel S.  ... NaN   
1  International  Outdoor    Hard  1st Round        3   Enqvist T.  ... NaN   
2  International  Outdoor    Hard  1st Round        3    Escude N.  ... NaN   
3  International  Outdoor    Hard  1st Round        3   Federer R.  ... NaN   
4  International  Outdoor    Hard  1st Round        3  Fromberg R.  ... NaN   

  UBL LBW  LBL  SJW  SJL  MaxW  MaxL  AvgW  AvgL  
0 NaN NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
1 NaN NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
2 NaN NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
3 NaN NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
4 NaN NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  

[5 rows x 54 columns]
/home/jeanwolf/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (11,12) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Aqui generamos un index (indice, ID) en una columna determinada. En este caso Location

In [4]:
datos.set_index("Location", inplace = True)

En la columna seleccionamos solo un grupo de datos. una lista.

In [5]:
dat = datos.loc["Doha"]
dat.to_csv("data/doha.csv")

Aqui combinamos selecciones, por un lado en la columna del index (Location), se selecciona un lugar (Miami) y por otro lado, en la columna Winer toda la columna.

de esa forma se genera un archivo que muestra los ganadores en Miami de la ATP

In [6]:
dat = datos.loc["Miami", "Winner"]
dat.head()
dat.to_csv("data/Winner.csv")

Ejercicio: hacer una tabla con los ganadores en Adelaide, por año y superficie

Aqui generamos una serie de listas pareadas. en este caso una primera por el index, y las otras en funcion de lo que se quiere por columna

In [7]:
print("Ganadores en adelaide por fecha")
dat = datos.loc[["Adelaide"], ["Date", "Winner"]]
print(dat.head(100))
dat.to_csv("data/Winner.csv")
Ganadores en adelaide por fecha
                Date       Winner
Location                         
Adelaide   3/01/2000   Dosedel S.
Adelaide   3/01/2000   Enqvist T.
Adelaide   3/01/2000    Escude N.
Adelaide   3/01/2000   Federer R.
Adelaide   3/01/2000  Fromberg R.
...              ...          ...
Adelaide  30/12/2002    Spadea V.
Adelaide  30/12/2002    Vahaly B.
Adelaide  30/12/2002    Voinea A.
Adelaide  30/12/2002  Zabaleta M.
Adelaide  31/12/2002    Dupuis A.

[100 rows x 2 columns]

Aqui, con mas campos

In [8]:
dat = datos.loc[["Adelaide"], ["Date", "Winner", "Loser"]]
print(dat.head(100))
dat.to_csv("data/Winner.csv")
                Date       Winner           Loser
Location                                         
Adelaide   3/01/2000   Dosedel S.     Ljubicic I.
Adelaide   3/01/2000   Enqvist T.      Clement A.
Adelaide   3/01/2000    Escude N.   Baccanello P.
Adelaide   3/01/2000   Federer R.  Knippschild J.
Adelaide   3/01/2000  Fromberg R.   Woodbridge T.
...              ...          ...             ...
Adelaide  30/12/2002    Spadea V.         Reid T.
Adelaide  30/12/2002    Vahaly B.      Kennedy A.
Adelaide  30/12/2002    Voinea A.       Rochus C.
Adelaide  30/12/2002  Zabaleta M.        Waske A.
Adelaide  31/12/2002    Dupuis A.         Beck K.

[100 rows x 3 columns]

Aqui hacemos selecciones por rango, desde una columna A a la B

In [9]:
dat = datos.loc[["Atlanta", "Melbourne"], "Series" : "Loser"]
print(dat.head(100))
dat.to_csv("data/Winner.csv")
                 Series    Court Surface      Round  Best of        Winner  \
Location                                                                     
Atlanta   International  Outdoor    Clay  1st Round        3     Agassi A.   
Atlanta   International  Outdoor    Clay  1st Round        3     Alonso J.   
Atlanta   International  Outdoor    Clay  1st Round        3      Chang M.   
Atlanta   International  Outdoor    Clay  1st Round        3  Goldstein P.   
Atlanta   International  Outdoor    Clay  1st Round        3       Huet S.   
...                 ...      ...     ...        ...      ...           ...   
Atlanta          ATP250  Outdoor    Hard  1st Round        3       Lu Y.H.   
Atlanta          ATP250  Outdoor    Hard  1st Round        3     Hewitt L.   
Atlanta          ATP250  Outdoor    Hard  1st Round        3   Harrison R.   
Atlanta          ATP250  Outdoor    Hard  1st Round        3      Mahut N.   
Atlanta          ATP250  Outdoor    Hard  1st Round        3        Ram R.   

                       Loser  
Location                      
Atlanta           Malisse X.  
Atlanta               Kim K.  
Atlanta       Sanguinetti D.  
Atlanta   Harper-Griffith L.  
Atlanta           Russell M.  
...                      ...  
Atlanta         Bachinger M.  
Atlanta          Simmonds P.  
Atlanta            Sugita Y.  
Atlanta          Berankis R.  
Atlanta          Dimitrov G.  

[100 rows x 7 columns]

Aqui seleccionamos un grupo de datos que en una columna terminan con un texto especifico.

por ejemplo: Todos los datos de Gran Slam

# (.str.endswith("XX")]) es el comando para seleccionar las entradas que en la columna Series contienen la palabra Salam al final

In [10]:
print ("Solo Gran Slam")
dat = datos.loc[datos["Series"].str.endswith("Slam")]
dat.to_csv("data/Winner.csv")
Solo Gran Slam

Ejercicio: hacer un dataset de los campeonatos International Gold, por ganador y perdedor y por fecha.

In [11]:
dat = datos.loc[datos["Series"].str.endswith("Gold")]
dat.to_csv("data/Winner.csv")
datos = pd.read_csv("data/Winner.csv" , encoding="ISO-8859-1")
datos.set_index("Location", inplace = True)
print(dat.head(100))
dat = datos[["Date", "Winner", "Loser"]]
dat.to_csv("data/Winner.csv")
          ATP       Tournament        Date              Series   Court  \
Location                                                                 
Memphis    10  Kroger St. Jude  14/02/2000  International Gold  Indoor   
Memphis    10  Kroger St. Jude  14/02/2000  International Gold  Indoor   
Memphis    10  Kroger St. Jude  14/02/2000  International Gold  Indoor   
Memphis    10  Kroger St. Jude  14/02/2000  International Gold  Indoor   
Memphis    10  Kroger St. Jude  14/02/2000  International Gold  Indoor   
...       ...              ...         ...                 ...     ...   
London     12          AXA Cup  21/02/2000  International Gold  Indoor   
London     12          AXA Cup  21/02/2000  International Gold  Indoor   
London     12          AXA Cup  21/02/2000  International Gold  Indoor   
London     12          AXA Cup  21/02/2000  International Gold  Indoor   
London     12          AXA Cup  21/02/2000  International Gold  Indoor   

         Surface      Round  Best of         Winner          Loser  ... UBW  \
Location                                                            ...       
Memphis     Hard  1st Round        3        Levy H.     Sekulov J.  ... NaN   
Memphis     Hard  1st Round        3     MacPhie B.       Manta L.  ... NaN   
Memphis     Hard  1st Round        3      Mamiit C.      Jensen L.  ... NaN   
Memphis     Hard  1st Round        3       Massu N.       Etlis G.  ... NaN   
Memphis     Hard  1st Round        3        Popp A.     Behrend T.  ... NaN   
...          ...        ...      ...            ...            ...  ...  ..   
London      Hard  2nd Round        3     Federer R.  Ivanisevic G.  ... NaN   
London      Hard  2nd Round        3  Kafelnikov Y.       Vacek D.  ... NaN   
London      Hard  2nd Round        3     Pioline C.     Tarango J.  ... NaN   
London      Hard  2nd Round        3      Rosset M.      Hrbaty D.  ... NaN   
London      Hard  2nd Round        3    Rusedski G.     Golmard J.  ... NaN   

         UBL  LBW  LBL  SJW  SJL  MaxW  MaxL  AvgW  AvgL  
Location                                                  
Memphis  NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
Memphis  NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
Memphis  NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
Memphis  NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
Memphis  NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
...       ..  ...  ...  ...  ...   ...   ...   ...   ...  
London   NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
London   NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
London   NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
London   NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  
London   NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN  

[100 rows x 53 columns]

Ejercicio: Generar desde el DataFrame original una tabla con X campos.

Aquí, se usa dobles corchetes para generar una lista

In [12]:
dat = datos[["Tournament", "Date", "Winner", "Loser"]]
dat
Out[12]:
Tournament Date Winner Loser
Location
Memphis Kroger St. Jude 14/02/2000 Levy H. Sekulov J.
Memphis Kroger St. Jude 14/02/2000 MacPhie B. Manta L.
Memphis Kroger St. Jude 14/02/2000 Mamiit C. Jensen L.
Memphis Kroger St. Jude 14/02/2000 Massu N. Etlis G.
Memphis Kroger St. Jude 14/02/2000 Popp A. Behrend T.
... ... ... ... ...
Vienna BA-CA Tennis Trophy 10/10/2008 Lopez F. Melzer J.
Vienna BA-CA Tennis Trophy 10/10/2008 Petzschner P. Moya C.
Vienna BA-CA Tennis Trophy 11/10/2008 Monfils G. Kohlschreiber P.
Vienna BA-CA Tennis Trophy 11/10/2008 Petzschner P. Lopez F.
Vienna BA-CA Tennis Trophy 12/10/2008 Petzschner P. Monfils G.

3755 rows × 4 columns

Share: