Spark SQL 和 MySQL- SaveMode.Overwrite 不插入修改后的数据

Spark SQL and MySQL- SaveMode.Overwrite not inserting modified data(Spark SQL 和 MySQL- SaveMode.Overwrite 不插入修改后的数据)
本文介绍了Spark SQL 和 MySQL- SaveMode.Overwrite 不插入修改后的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 MySQL 中有一个 test 表,其 ID 和名称如下所示:

+----+-------+|身份证 |姓名 |+----+-------+|1 |姓名1 |+----+-------+|2 |姓名2 |+----+-------+|3 |姓名3 |+----+-------+

我正在使用 Spark DataFrame 读取此数据(使用 JDBC)并像这样修改数据

Datasetmodified = sparkSession.sql("select id, concat(name,' - new') as name from test");modified.write().mode("overwrite").jdbc(AppProperties.MYSQL_CONNECTION_URL,测试",连接属性);

但我的问题是,如果我提供覆盖模式,它会删除以前的表并创建一个新表但不插入任何数据.

我通过从 csv 文件(与测试表相同的数据)读取并覆盖来尝试相同的程序.那对我有用.

我在这里遗漏了什么吗?

谢谢!

解决方案

问题出在您的代码中.因为你覆盖了一个你试图从中读取的表,所以在 Spark 可以实际访问它之前,你有效地清除了所有数据.

记住 Spark 是懒惰的.当您创建 Dataset 时,Spark 会获取所需的元数据,但不会加载数据.所以没有可以保留原始内容的魔法缓存.数据将在实际需要时加载.这是当您执行 write 操作并且当您开始写入时没有更多数据要获取时.

你需要的是这样的:

  • 创建一个数据集.
  • 应用所需的转换并将数据写入中间 MySQL 表.

  • TRUNCATE 原始输入和 INSERT INTO ... SELECT 来自中间表或 DROP 原始表和 RENAME 中间表.

另一种但不太有利的方法是:

  • 创建一个数据集.
  • 应用所需的转换并将数据写入持久 Spark 表(df.write.saveAsTable(...) 或等效项)
  • TRUNCATE 原始输入.
  • 读回数据并保存 (spark.table(...).write.jdbc(...))
  • 删除 Spark 表.

我们不能过分强调使用 Spark cache/persist 不是正确的方法.即使采用保守的 StorageLevel (MEMORY_AND_DISK_2/MEMORY_AND_DISK_SER_2) 缓存数据也可能丢失(节点故障),导致无提示的正确性错误.>

I have a test table in MySQL with id and name like below:

+----+-------+
| id | name  |
+----+-------+
| 1  | Name1 |
+----+-------+
| 2  | Name2 |
+----+-------+
| 3  | Name3 |
+----+-------+

I am using Spark DataFrame to read this data (using JDBC) and modifying the data like this

Dataset<Row> modified = sparkSession.sql("select id, concat(name,' - new') as name from test");
modified.write().mode("overwrite").jdbc(AppProperties.MYSQL_CONNECTION_URL,
                "test", connectionProperties);

But my problem is, if I give overwrite mode, it drops the previous table and creates a new table but not inserting any data.

I tried the same program by reading from a csv file (same data as test table) and overwriting. That worked for me.

Am I missing something here ?

Thank You!

解决方案

The problem is in your code. Because you overwrite a table from which you're trying to read you effectively obliterate all data before Spark can actually access it.

Remember that Spark is lazy. When you create a Dataset Spark fetches required metadata, but doesn't load the data. So there is no magic cache which will preserve original content. Data will be loaded when it is actually required. Here it is when you execute write action and when you start writing there is no more data to be fetched.

What you need is something like this:

  • Create a Dataset.
  • Apply required transformations and write data to an intermediate MySQL table.

  • TRUNCATE the original input and INSERT INTO ... SELECT from the intermediate table or DROP the original table and RENAME intermediate table.

Alternative, but less favorable approach, would be:

  • Create a Dataset.
  • Apply required transformations and write data to a persistent Spark table (df.write.saveAsTable(...) or equivalent)
  • TRUNCATE the original input.
  • Read data back and save (spark.table(...).write.jdbc(...))
  • Drop Spark table.

We cannot stress enough that using Spark cache / persist is not the way to go. Even in with the conservative StorageLevel (MEMORY_AND_DISK_2 / MEMORY_AND_DISK_SER_2) cached data can be lost (node failures), leading to silent correctness errors.

这篇关于Spark SQL 和 MySQL- SaveMode.Overwrite 不插入修改后的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Hibernate reactive No Vert.x context active in aws rds(AWS RDS中的休眠反应性非Vert.x上下文处于活动状态)
Bulk insert with mysql2 and NodeJs throws 500(使用mysql2和NodeJS的大容量插入抛出500)
Flask + PyMySQL giving error no attribute #39;settimeout#39;(FlASK+PyMySQL给出错误,没有属性#39;setTimeout#39;)
auto_increment column for a group of rows?(一组行的AUTO_INCREMENT列?)
Sort by ID DESC(按ID代码排序)
SQL/MySQL: split a quantity value into multiple rows by date(SQL/MySQL:按日期将数量值拆分为多行)