问题描述
我有一个任务来加载一个格式奇怪的文本文件.该文件也包含不需要的数据.它包含两个背靠背的标题,每个标题的数据在交替行上指定.标题行在 ------
之后开始.我需要读取标题及其相应的数据并将其转储到某些 Excel/表格目标中.让我知道如何使用 SSIS 中的任何转换或脚本来解决这个问题.不知道如何为此使用脚本任务.
现在我正在读取一列中的文件,并使用派生列手动尝试使用 substring
函数拆分它.但这仅适用于一个标题,而且它的编码类型太硬.我需要一些动态方法来直接读取标题行和数据行.
输入文件:
A1234-012 I N F O R M A T I C S C O M P A N Y 08/23/17第 2 页 批量 ABC 付款日期 &截止日期 编辑页 481------------------------------------------------------------------------------------------------------------------------------------SEO XRAT CLT LOAN Opening Payment MATURIUH LOAN NEXE ORIG-AMT OFF TO CATE CONTC MON NO.TO TOL NEL S CUP CO IND PATNOM 代码 NOM NOMTER DATE DUO DATE DATE TIME PT # MONEY AQ LOAN NUMBER BLOCK PAYMENT U TYP GH OMG IND1-3 4-6 7-13/90-102 14-19 20-25 26-31 32-34 35-37 38-46 47-48 49 50-51 52-61 62 63 64-72 73 4-576 77 8-80------------------------------------------------------------------------------------------------------------------------------------SEO XRAT CLT LOAN A/C A/C MIN MAX MAX PENDI LATE CCH L/F PARTLYS CUR L/F L/F L/FNOM 代码 NOM 代码系数 MON MON ROAD DAYS MONE POT L/A L/F JAC INT VAD CD 使用 PI VAD DT1-3 4-6 7-13/90-102 14 15 20-23 24-29 30-34 35-37 38-42 43 44 49 60 61-63 64-69使用 ID:------------------------------------------------------------------------------------------------------------------------------------454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,535 H025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,535 H025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,535 H025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
预期输出应该是:
文件 1:
SEO XRAT CLT LOAN OPENING PAYMENT MATURIUH LOAN NEXE ORIG-AMT OFF TO CATE CONTC MON NO.TO TOL NEL S CUP CO IND PATNOM 代码 NOM NOMTER DATE DUO DATE DATE TIME PT # MONEY AQ LOAN NUMBER BLOCK PAYMENT U TYP GH OMG IND454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,535 H454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,535 H454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,535 H
文件 2:
SEO XRAT CLT LOAN A/C A/C MIN MAX MAX PENDI LATE CCH L/F PARTLYS CUR L/F L/F L/FNOM 代码 NOM 代码系数 MON MON ROAD DAYS MONE POT L/A L/F JAC INT VAD CD 使用 PI VAD DT025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
忽略前 3 行
要忽略前 3 行,您可以简单地配置平面文件连接管理器以忽略它们,类似于:
拆分文件并删除坏行
1.配置连接管理器
此外,在平面文件连接管理器中,转到高级选项卡并删除除一列之外的所有列并将其数据类型更改为DT_STR
,并将 MaxLength 更改为 4000
.
添加两个连接管理器,一个用于每个目标文件,您必须仅定义一列最大长度 = 4000:
2.配置数据流任务
添加一个数据流任务,并在里面添加一个平面文件源.选择源文件连接管理器.
使用以下表达式添加条件拆分:
文件 1
FINDSTRING([Column 0],OPENING",1) >1 ||FINDSTRING([第 0 列],日期",1) >1 ||TOKENCOUNT([第 0 列]," ") == 19
文件 2
FINDSTRING([Column 0],A/C",1) >1 ||FINDSTRING([Column 0],FACTOR",1) >1 ||TOKENCOUNT([第 0 列]," ") == 10
上面的表达式是根据您在问题中提到的预期输出创建的,我厌倦了在每个标题中搜索唯一关键字并根据空格出现次数拆分数据行.
最后将每个输出映射到目标平面文件组件:
实验
执行结果如下图所示:
更新 1 - 删除重复项
要删除重复项,您必须参考以下链接:
然后在脚本编辑器(C#)中编写以下脚本:
首先确保您添加了 RegularExpressions 命名空间
使用 System.Text.RegularExpressions;
脚本代码
int SEOCount = 0;int NOMCount = 0;Regex regex = new Regex("[ ]{2,}", RegexOptions.None);公共覆盖无效 Input0_ProcessInputRow(Input0Buffer Row){if (Row.Column0.Trim().StartsWith(SEO")){如果(SEOCount == 0){SEOCount++;Row.outFlag = true;}别的{Row.outFlag = false;}}else if (Row.Column0.Trim().StartsWith(NOM")){如果(NOMCount == 0){NOM计数++;Row.outFlag = true;}别的{Row.outFlag = false;}}else if (Row.Column0.Trim().StartsWith(PAGE")){Row.outFlag = false;}别的{Row.outFlag = true;}Row.outColumn0 = regex.Replace(Row.Column0.TrimStart(), " ");}
条件分割
在每个 Script Component 后添加条件拆分,并使用以下表达式过滤重复的标题:
[outFlag] == True
并将条件拆分连接到目的地.确保将
outColumn0
映射到目标列.包链接
- https://www.dropbox.com/s/d936u4xo3mkzns8/Package.dtsx?dl=0
I got a task to load a strangely formatted text file. The file contains unwanted data too. It contains two headers back to back and data for each header is specified on alternate lines. Header rows start after
------
. I need to read both the header along with its corresponding data and dump it into some Excel/table destination using. Let me know how to solve this using any transformation in SSIS or maybe with a script. Don't know how to use a script task for this.Right now I am reading the file in one column and using a derived column manually trying to split it using
substring
function. But that works for only one header and it is too hard coded type. I need some dynamic approach to read header rows as well as data rows directly.Input file:
A1234-012 I N F O R M A T I C S C O M P A N Y 08/23/17 PAGE 2 BATCH ABC PAYMENT DATE & DUE DATE EDIT PAGE 481 ------------------------------------------------------------------------------------------------------------------------------------ SEO XRAT CLT LOAN OPENING PAYMENT MATURIUH LOAN NEXE ORIG-AMT OFF TO CATE CONTC MON NO.TO TOL NEL S CUP CO IND PAT NOM CODE NOM NOMTER DATE DUO DATE DATE TIME PT # MONEY AQ LOAN NUMBER BLOCK PAYMENT U TYP GH OMG IND 1-3 4-6 7-13/90-102 14-19 20-25 26-31 32-34 35-37 38-46 47-48 49 50-51 52-61 62 63 64-72 73 4-5 76 77 8-80 ------------------------------------------------------------------------------------------------------------------------------------ SEO XRAT CLT LOAN A/C A/C MIN MAX MAX PENDI LATE CCH L/F PARTLYS CUR L/F L/F L/F NOM CODE NOM NOMTER CODE FACTOR MON MON ROAD DAYS MONE POT L/A L/F JAC INT VAD CD USED PI VAD DT 1-3 4-6 7-13/90-102 14 15 20-23 24-29 30-34 35-37 38-42 43 44 49 60 61-63 64-69 USED-ID: ------------------------------------------------------------------------------------------------------------------------------------ 454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55 025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS 454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55 025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS 454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55 025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
Expected output should be:
FILE 1:
SEO XRAT CLT LOAN OPENING PAYMENT MATURIUH LOAN NEXE ORIG-AMT OFF TO CATE CONTC MON NO.TO TOL NEL S CUP CO IND PAT NOM CODE NOM NOMTER DATE DUO DATE DATE TIME PT # MONEY AQ LOAN NUMBER BLOCK PAYMENT U TYP GH OMG IND 454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55 454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55 454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
FILE 2:
SEO XRAT CLT LOAN A/C A/C MIN MAX MAX PENDI LATE CCH L/F PARTLYS CUR L/F L/F L/F NOM CODE NOM NOMTER CODE FACTOR MON MON ROAD DAYS MONE POT L/A L/F JAC INT VAD CD USED PI VAD DT 025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS 025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS 025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
解决方案Ignore first 3 rows
To ignore first 3 rows you can simply configure the flat file connection manager to ignore them, similar to:
Split file and remove bad rows
1. Configure connection managers
In addition, in the flat file connection manager, go to the advanced tab and delete all columns except one and change its data type to
DT_STR
and the MaxLength to4000
.Add two connection managers , one for each destination file where you must define only one column with max length = 4000:
2. Configure Data flow task
Add a Data Flow Task, And add a Flat File Source inside. Select the Source File connection manager.
Add a conditional split with the following expressions:
File1
FINDSTRING([Column 0],"OPENING",1) > 1 || FINDSTRING([Column 0],"DATE",1) > 1 || TOKENCOUNT([Column 0]," ") == 19
File2
FINDSTRING([Column 0],"A/C",1) > 1 || FINDSTRING([Column 0],"FACTOR",1) > 1 || TOKENCOUNT([Column 0]," ") == 10
The expressions above are created based on the expected output you mentioned in the question, i tired to search for unique keywords inside each header and splitted the data rows based on the number of space occurrence.
Finally Map each output to a destination flat file component:
Experiments
The execution result is shown in the following screenshots:
Update 1 - Remove duplicates
To remove duplicates you must you can refer to the following link:
- How to remove duplicate rows from flat file using SSIS?
Update 2 - Remove only duplicates headers + Replace spaces with Tab
If you need only to remove duplicate headers then you can do this in two steps:
- Add a script component after each conditional split output to flag unwanted rows
- Add a conditional split to filter rows based on the script component output
In addition, because the columns values does not contains spaces you can use regular expression to replace spaces with single Tab to make the file consistent.
Script Component
In the Script Component add an output column of type DT_BOOL and name it
outFlag
also add a output columnoutColumn0
of typeDT_STR
and length equal to4000
and selectColumn0
as Input Column.Then write the following script in the Script Editor (C#):
First make sure that you add the RegularExpressions namespace
using System.Text.RegularExpressions;
Script Code
int SEOCount = 0; int NOMCount = 0; Regex regex = new Regex("[ ]{2,}", RegexOptions.None); public override void Input0_ProcessInputRow(Input0Buffer Row) { if (Row.Column0.Trim().StartsWith("SEO")) { if (SEOCount == 0) { SEOCount++; Row.outFlag = true; } else { Row.outFlag = false; } } else if (Row.Column0.Trim().StartsWith("NOM")) { if (NOMCount == 0) { NOMCount++; Row.outFlag = true; } else { Row.outFlag = false; } } else if (Row.Column0.Trim().StartsWith("PAGE")) { Row.outFlag = false; } else { Row.outFlag = true; } Row.outColumn0 = regex.Replace(Row.Column0.TrimStart(), " "); }
Conditional Split
Add a conditional split after each Script Component and use the following expression to filter duplicate header:
[outFlag] == True
And connect the conditional split to the destination. Make Sure to map
outColumn0
to the destination column.Package link
- https://www.dropbox.com/s/d936u4xo3mkzns8/Package.dtsx?dl=0
这篇关于如何将带有两个标题的混合记录类型固定宽度文件加载到两个单独的文件中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!