Comment récupérer les données d'un élément d'une liste à l'aide de beautiful soup ?

Question

Comment récupérer les données d'un élément d'une liste à l'aide de beautiful soup ?

Demandé el 12 de Juin, 2022: Quand la question a-t-elle été
76 affichage: Nombre de visites la question a
3 Réponses: Nombre de réponses aux questions
Résolu: Situation réelle de la question

Le code ci-dessous récupère les données html dans une liste. J'essaie d'extraire un élément spécifique appelé data-append-csv (l'exemple est : data-append-csv="abbotco01" ) à partir du lien html de la page de référence du baseball (voir le code pour le lien) :

Code actuel :

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml
[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x]

Paramètres de l'environnement actuel :

dependencies:
  - python=3.9.7
  - beautifulsoup4=4.11.1
  - jupyterlab=3.3.2
  - pandas=1.4.2
  - pyodbc=4.0.32

L'objectif final : être capable d'avoir un dataframe pandas qui a chaque élément de data-append-csv de la table html.

indice

data-append-csv

0

abbotco01

1

abreual01

2

abreubr01

etc.

Demandé el 12 de Juin, 2022 par Clutch_Dude

Answer 1

3 Réponses

Answer 2

2voto

HedgeHog Points 2934

Convertissez d'abord la chaîne de caractères en un BeautifulSoup et .select('[data-append-csv]') :

table = [x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0]
[(a.find_previous('th').text,a.get('data-append-csv')) for a in BeautifulSoup(table).select('[data-append-csv]')]

Pour garantir une jointure correcte à vos données d'origine, essayez de récupérer également le rang au cas où il y aurait des lignes sans ces attributs et que la longueur des deux cadres de données serait différente :

(a.find_previous('th').text,a.get('data-append-csv'))

Vous pouvez maintenant créer votre cadre de données à partir de votre liste :

pd.DataFrame([(a.find_previous('th').text,a.get('data-append-csv')) for a in BeautifulSoup(table).select('[data-append-csv]')],columns=['Rk','data-append-csv'],dtype='object')

Exemple

Joignez vos données à votre cadre de données initial et vérifiez la dernière colonne :

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.text)
table = [x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0]

### create and clean dataframe 1
df1 = pd.read_html(table)[0]
df1 = df1[(~df1.Rk.isna()) & (df1.Rk != 'Rk')]
df1.set_index('Rk', inplace=True)

### create and clean dataframe 2
df2 = pd.DataFrame([(a.find_previous('th').text,a.get('data-append-csv')) for a in BeautifulSoup(table).select('[data-append-csv]')],columns=['Rk','data-append-csv'],dtype='object')
df2.set_index('Rk', inplace=True)

### join both dataframe
df1.join(df2).reset_index()

Sortie

Rk

Nom

Âge

Tm

Lg

G

PA

AB

R

H

2B

3B

RH

RBI

SB

CS

BB

SO

BA

OBP

SLG

OPS

OPS+

TB

PIB

HBP

SH

SF

IBB

Résumé du poste

data-append-csv

0

1

Fernando Abad*

35

BAL

AL

2

0

nan

0

1

abadfe01

1

2

Cory Abbott

25

CCH

NL

8

3

0

1

0

1

0.333

0.667

81

1

0

/1H

abbotco01

2

3

Albert Abreu

25

NYY

AL

3

0

nan

0

1

abreual01

3

4

Bryan Abreu

24

HOU

AL

1

0

nan

0

1

abreubr01

4

5

José Abreu

34

CHW

AL

152

659

566

86

148

30

2

30

117

1

0

61

143

0.261

0.351

0.481

0.831

124

272

28

22

0

10

3

*3D/5

abreujo02

....

Répondu el 12 de Juin, 2022 par HedgeHog (2934 Points )

Answer 3

1voto

baduker Points 1223

Vous devriez pouvoir obtenir la table avec ça :

import requests
from bs4 import BeautifulSoup
from bs4 import Comment

import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"
}
url = "https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml"

with requests.Session() as s:
    comments = (
        BeautifulSoup(
            s.get(url, headers=headers).text,
            "lxml"
        ).find_all(string=lambda text: isinstance(text, Comment))
    )
    table = pd.concat(
        pd.read_html(
            [c for c in comments if "players_standard_batting" in c][0]
        )
    )
    print(table)
    table.to_csv("batting.csv", index=False)

Sortie :

        Rk               Name  Age   Tm   Lg  ... HBP SH  SF IBB Pos Summary
0        1     Fernando Abad*   35  BAL   AL  ...   0  0   0   0           1
1        2        Cory Abbott   25  CHC   NL  ...   0  0   0   0         /1H
2        3       Albert Abreu   25  NYY   AL  ...   0  0   0   0           1
3        4        Bryan Abreu   24  HOU   AL  ...   0  0   0   0           1
4        5         José Abreu   34  CHW   AL  ...  22  0  10   3       *3D/5
...    ...                ...  ...  ...  ...  ...  .. ..  ..  ..         ...
1787  1720  Bruce Zimmermann*   26  BAL   AL  ...   0  0   0   0           1
1788  1721  Jordan Zimmermann   35  MIL   NL  ...   0  0   0   0          /1
1789  1722        Tyler Zuber   26  KCR   AL  ...   0  0   0   0           1
1790  1723        Mike Zunino   30  TBR   AL  ...   7  0   1   0         2/H
1791   NaN   LgAvg per 600 PA  NaN  NaN  NaN  ...   7  2   4   2         NaN

[1792 rows x 30 columns]

Et le csv téléchargé :

Répondu el 12 de Juin, 2022 par baduker (1223 Points )

Answer 4

0voto

Alberto Hanna Points 465

Vous devez convertir le commentaire html que vous avez extrait et l'analyser en utilisant BeautifulSoup, puis utiliser le sélecteur CSS pour obtenir les rangées avec l'attribut 'data-append-csv' dans ses attributs.

import requests
import pandas as pd
from bs4 import Comment, BeautifulSoup

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, 'html.parser')

table_txt = [x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0]

table_soup = BeautifulSoup(table_txt, 'html.parser')

list_ = [{'index':index, 'data-append-csv':player['data-append-csv']} for index, player in enumerate(table_soup.select('td[data-append-csv]'), start=1)]

df = pd.DataFrame(list_)

Répondu el 12 de Juin, 2022 par Alberto Hanna (465 Points )

Comment récupérer les données d'un élément d'une liste à l'aide de beautiful soup ?

Réponses

Exemple

Sortie

Questions en vedette

Top Tags

Prograide.com

Powered by:

Comment récupérer les données d'un élément d'une liste à l'aide de beautiful soup ?

Réponses

Exemple

Sortie

Questions en vedette

Top Tags

Dans notre réseau

Prograide.com

Powered by: