在 BeautifulSoup 中使用字典解析脚本标签

Parsing a script tag with dicts in BeautifulSoup(在 BeautifulSoup 中使用字典解析脚本标签)
本文介绍了在 BeautifulSoup 中使用字典解析脚本标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为 this 问题提供部分答案,我来了bs4.element.Tag 是一堆嵌套的字典和列表(s,下面).

Working on a partial answer to this question, I came across a bs4.element.Tag that is a mess of nested dicts and lists (s, below).

有没有办法使用 re.find_all 返回包含在 s 中的 url 列表?有关此标签结构的其他评论也很有帮助.

Is there a way to return a list of urls contained in s without using re.find_all? Other comments regarding the structure of this tag are helpful too.

from bs4 import BeautifulSoup
import requests

link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')

s = soup.find('script', type='application/ld+json')

## the first bit of s:
# s
# Out[116]: 
# <script type="application/ld+json">
# {"@context":"http://schema.org","@type":"ItemList","numberOfItems":50,

我尝试过的:

  • s 上随机浏览带有 tab 补全的方法.
  • 通过文档进行挑选.
  • randomly perusing through methods with tab completion on s.
  • picking through the docs.

我的问题是 s 只有 1 个属性(type)而且似乎没有任何子标签.

My problem is that s only has 1 attribute (type) and doesn't seem to have any child tags.

推荐答案

可以使用s.text来获取脚本的内容.它是 JSON,因此您可以使用 json.loads 对其进行解析.从那里,它是简单的字典访问:

You can use s.text to get the content of the script. It's JSON, so you can then just parse it with json.loads. From there, it's simple dictionary access:

import json

from bs4 import BeautifulSoup
import requests

link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)

soup = BeautifulSoup(r.text, 'html.parser')

s = soup.find('script', type='application/ld+json')

urls = [el['url'] for el in json.loads(s.text)['itemListElement']]

print(urls)

这篇关于在 BeautifulSoup 中使用字典解析脚本标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Leetcode 234: Palindrome LinkedList(Leetcode 234:回文链接列表)
How do I read an Excel file directly from Dropbox#39;s API using pandas.read_excel()?(如何使用PANDAS.READ_EXCEL()直接从Dropbox的API读取Excel文件?)
subprocess.Popen tries to write to nonexistent pipe(子进程。打开尝试写入不存在的管道)
I want to realize Popen-code from Windows to Linux:(我想实现从Windows到Linux的POpen-code:)
Reading stdout from a subprocess in real time(实时读取子进程中的标准输出)
How to call type safely on a random file in Python?(如何在Python中安全地调用随机文件上的类型?)