自定义WildReceipt Paddle Dataset

本文版权归作者所有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。

转载自夜明的孤行灯

本文链接地址: https://www.huangyunkun.com/2023/11/23/wildreceipt-paddle-dataset/



Paddle Dataset是Paddle生态中的数据源抽象,至少需要提供两个方法

def __getitem__(self, idx):


)
def __len__(self):

)

将自己的数据封装为Dataset后可以配合高层API使用,也可以享受Paddle生态的各种加强,比如批量加载等。

Paddle内部包含一些常用的数据集,比如MNIST。常用数据集的封装还包含了自动下载,非常适合新手使用。

WildReceipt数据集作为文本关键信息提取的基准,无论从数据量还是结构上,都要优于其他公开的数据集。主要用于文档的关键信息提取训练。这里演示下怎么制作Dataset。

数据集结构

制作数据集的第一步是了解数据集,了解数据集的结构和需要的输出,这里的输出可能需要关联具体的模型。

这里使用WildReceipt + SDMGR进行演示。
WildReceipt数据集主要分两部分,一部分是图片本身,一部分是区域标注。这部分信息存储在txt文件中,图片放在images目录中。由于是在Paddle中,这里直接使用https://paddleocr.bj.bcebos.com/ppstructure/dataset/wildreceipt.tar。

下面是数据的一些片段

image_files/Image_12/10/845be0dd6f5b04866a2042abd28d558032ef2576.jpeg	[{"label": "Store_name_value", "transcription": "CHOEUN", "points": [[114.0, 19.0], [230.0, 19.0], [230.0, 1.0], [114.0, 1.0]]}, {"label": "Store_name_value", "transcription": "KOREANRESTAURANT", "points": [[97.0, 35.0], [236.0, 35.0], [236.0, 19.0], [97.0, 19.0]]}, {"label": "Store_addr_value", "transcription": "2621ORANGETHORPEAVE,FULLERTON.", "points": [[29.0, 56.0], [295.0, 56.0], [295.0, 34.0], [29.0, 34.0]]}, {"label": "Tel_value", "transcription": "(714)879-3574", "points": [[48.0, 73.0], [280.0, 73.0], [280.0, 54.0], [48.0, 54.0]]}, {"label": "Others", "transcription": "THANKYOU!!", "points": [[79.0, 92.0], [259.0, 92.0], [259.0, 74.0], [79.0, 74.0]]}, {"label": "Date_key", "transcription": "DATE", "points": [[22.0, 130.0], [61.0, 130.0], [61.0, 112.0], [22.0, 112.0]]}, {"label": "Date_value", "transcription": "12/30/2016FRI", "points": [[70.0, 131.0], [192.0, 131.0], [192.0, 112.0], [70.0, 112.0]]}, {"label": "Time_value", "transcription": "19:19", "points": [[263.0, 128.0], [307.0, 128.0], [307.0, 111.0], [263.0, 111.0]]}, {"label": "Prod_item_value", "transcription": "BIBIM.OCTOPUT1", "points": [[19.0, 168.0], [157.0, 168.0], [157.0, 149.0], [19.0, 149.0]]}, {"label": "Prod_item_value", "transcription": "S-FOODP.CAKT1", "points": [[17.0, 190.0], [158.0, 190.0], [158.0, 171.0], [17.0, 171.0]]}, {"label": "Prod_item_value", "transcription": "PORKDUMPLINT1", "points": [[14.0, 214.0], [158.0, 214.0], [158.0, 192.0], [14.0, 192.0]]}, {"label": "Prod_item_value", "transcription": "LABEEFRIBT1", "points": [[14.0, 236.0], [151.0, 236.0], [151.0, 215.0], [14.0, 215.0]]}, {"label": "Prod_price_value", "transcription": "$13.99", "points": [[254.0, 168.0], [312.0, 168.0], [312.0, 149.0], [254.0, 149.0]]}, {"label": "Prod_price_value", "transcription": "$14.99", "points": [[257.0, 189.0], [314.0, 189.0], [314.0, 170.0], [257.0, 170.0]]}, {"label": "Prod_price_value", "transcription": "$8.99", "points": [[268.0, 212.0], [316.0, 212.0], [316.0, 191.0], [268.0, 191.0]]}, {"label": "Prod_price_value", "transcription": "¥17.99", "points": [[261.0, 234.0], [318.0, 234.0], [318.0, 213.0], [261.0, 213.0]]}, {"label": "Prod_item_key", "transcription": "4.00xITEMS", "points": [[118.0, 260.0], [217.0, 260.0], [217.0, 239.0], [118.0, 239.0]]}, {"label": "Subtotal_key", "transcription": "SUBTOTAL", "points": [[8.0, 285.0], [91.0, 285.0], [91.0, 264.0], [8.0, 264.0]]}, {"label": "Tax_key", "transcription": "TAX1", "points": [[8.0, 312.0], [49.0, 312.0], [49.0, 291.0], [8.0, 291.0]]}, {"label": "Total_key", "transcription": "TOTAL", "points": [[8.0, 336.0], [61.0, 336.0], [61.0, 316.0], [8.0, 316.0]]}, {"label": "Subtotal_value", "transcription": "$55.96", "points": [[263.0, 283.0], [325.0, 283.0], [325.0, 260.0], [263.0, 260.0]]}, {"label": "Tax_value", "transcription": "$4.48", "points": [[274.0, 308.0], [326.0, 308.0], [326.0, 286.0], [274.0, 286.0]]}, {"label": "Total_value", "transcription": "$60.44", "points": [[267.0, 334.0], [328.0, 334.0], [328.0, 310.0], [267.0, 310.0]]}, {"label": "Ignore", "transcription": "", "points": [[269.0, 347.0], [328.0, 347.0], [328.0, 336.0], [269.0, 336.0]]}, {"label": "Ignore", "transcription": "", "points": [[11.0, 347.0], [50.0, 347.0], [50.0, 342.0], [11.0, 342.0]]}, {"label": "Time_key", "transcription": "TIME", "points": [[215.0, 128.0], [253.0, 128.0], [253.0, 112.0], [215.0, 112.0]]}]
image_files/Image_83/7/f6b397503d69287709ba3872c7e548d45917cd2e.jpeg	[{"label": "Store_name_value", "transcription": "ILIO'S", "points": [[372.0, 242.0], [479.0, 242.0], [479.0, 178.0], [372.0, 178.0]]}, {"label": "Store_name_value", "transcription": "Restaurant", "points": [[338.0, 282.0], [508.0, 282.0], [508.0, 247.0], [338.0, 247.0]]}, {"label": "Store_addr_value", "transcription": "BretonischerRing7", "points": [[285.0, 324.0], [611.0, 324.0], [611.0, 289.0], [285.0, 289.0]]}, {"label": "Store_addr_value", "transcription": "85630Grasbrunn", "points": [[319.0, 367.0], [581.0, 367.0], [581.0, 332.0], [319.0, 332.0]]}, {"label": "Tel_key", "transcription": "TEL:", "points": [[304.0, 409.0], [368.0, 409.0], [368.0, 374.0], [304.0, 374.0]]}, {"label": "Others", "transcription": "Steuer-Nr.:514/78510", "points": [[65.0, 499.0], [442.0, 499.0], [442.0, 462.0], [65.0, 462.0]]}, {"label": "Others", "transcription": "RechnungNr.2844", "points": [[64.0, 623.0], [372.0, 623.0], [372.0, 552.0], [64.0, 552.0]]}, {"label": "Date_key", "transcription": "Datum:", "points": [[64.0, 656.0], [171.0, 656.0], [171.0, 624.0], [64.0, 624.0]]}, {"label": "Date_value", "transcription": "10.07.13", "points": [[197.0, 654.0], [335.0, 654.0], [335.0, 623.0], [197.0, 623.0]]}, {"label": "Time_value", "transcription": "21:52", "points": [[353.0, 653.0], [442.0, 653.0], [442.0, 623.0], [353.0, 623.0]]}, {"label": "Others", "transcription": "Tisch:102/--", "points": [[549.0, 652.0], [792.0, 652.0], [792.0, 617.0], [549.0, 617.0]]}, {"label": "Prod_quantity_value", "transcription": "4x", "points": [[120.0, 742.0], [155.0, 742.0], [155.0, 713.0], [120.0, 713.0]]}, {"label": "Prod_quantity_value", "transcription": "8x", "points": [[118.0, 788.0], [154.0, 788.0], [154.0, 756.0], [118.0, 756.0]]}, {"label": "Prod_quantity_value", "transcription": "3x", "points": [[118.0, 831.0], [151.0, 831.0], [151.0, 801.0], [118.0, 801.0]]}, {"label": "Prod_quantity_value", "transcription": "1x", "points": [[119.0, 876.0], [152.0, 876.0], [152.0, 845.0], [119.0, 845.0]]}, {"label": "Prod_quantity_value", "transcription": "1x", "points": [[119.0, 923.0], [152.0, 923.0], [152.0, 890.0], [119.0, 890.0]]}, {"label": "Prod_quantity_value", "transcription": "1x", "points": [[119.0, 967.0], [152.0, 967.0], [152.0, 936.0], [119.0, 936.0]]}, {"label": "Prod_quantity_value", "transcription": "1x", "points": [[118.0, 1012.0], [151.0, 1012.0], [151.0, 981.0], [118.0, 981.0]]}, {"label": "Prod_quantity_value", "transcription": "1x", "points": [[118.0, 1058.0], [149.0, 1058.0], [149.0, 1027.0], [118.0, 1027.0]]}, {"label": "Prod_quantity_value", "transcription": "2x", "points": [[112.0, 1104.0], [147.0, 1104.0], [147.0, 1070.0], [112.0, 1070.0]]}, {"label": "Prod_quantity_value", "transcription": "1x", "points": [[115.0, 1151.0], [145.0, 1151.0], [145.0, 1119.0], [115.0, 1119.0]]}, {"label": "Prod_quantity_value", "transcription": "1x", "points": [[115.0, 1198.0], [146.0, 1198.0], [146.0, 1164.0], [115.0, 1164.0]]}, {"label": "Prod_quantity_value", "transcription": "1x", "points": [[114.0, 1243.0], [147.0, 1243.0], [147.0, 1210.0], [114.0, 1210.0]]}, {"label": "Prod_quantity_value", "transcription": "1x", "points": [[113.0, 1288.0], [147.0, 1288.0], [147.0, 1253.0], [113.0, 1253.0]]}, {"label": "Prod_quantity_value", "transcription": "1x", "points": [[113.0, 1331.0], [145.0, 1331.0], [145.0, 1299.0], [113.0, 1299.0]]}, {"label": "Prod_item_value", "transcription": "Tee", "points": [[165.0, 1332.0], [218.0, 1332.0], [218.0, 1298.0], [165.0, 1298.0]]}, {"label": "Prod_item_value", "transcription": "Stifado", "points": [[165.0, 1285.0], [292.0, 1285.0], [292.0, 1251.0], [165.0, 1251.0]]}, {"label": "Prod_item_value", "transcription": "SchweinefiletMeta", "points": [[165.0, 1238.0], [493.0, 1238.0], [493.0, 1204.0], [165.0, 1204.0]]}, {"label": "Prod_item_value", "transcription": "BiftekiMetaxa", "points": [[165.0, 1193.0], [419.0, 1193.0], [419.0, 1159.0], [165.0, 1159.0]]}, {"label": "Ignore", "transcription": "", "points": [[165.0, 1152.0], [440.0, 1152.0], [440.0, 1111.0], [165.0, 1111.0]]}, {"label": "Prod_item_value", "transcription": "GyrosFolie", "points": [[167.0, 1107.0], [370.0, 1107.0], [370.0, 1066.0], [167.0, 1066.0]]}, {"label": "Prod_item_value", "transcription": "BabyKalamariGefu", "points": [[167.0, 1062.0], [495.0, 1062.0], [495.0, 1019.0], [167.0, 1019.0]]}, {"label": "Prod_item_value", "transcription": "Gyros", "points": [[168.0, 1016.0], [260.0, 1016.0], [260.0, 979.0], [168.0, 979.0]]}, {"label": "Prod_item_value", "transcription": "VegetarischeVaria", "points": [[171.0, 971.0], [493.0, 971.0], [493.0, 931.0], [171.0, 931.0]]}, {"label": "Prod_item_value", "transcription": "GrossesWasser", "points": [[169.0, 922.0], [422.0, 922.0], [422.0, 889.0], [169.0, 889.0]]}, {"label": "Prod_item_value", "transcription": "Saft0,25", "points": [[171.0, 877.0], [336.0, 877.0], [336.0, 841.0], [171.0, 841.0]]}, {"label": "Prod_item_value", "transcription": "Hefe-Weissbier", "points": [[171.0, 833.0], [422.0, 833.0], [422.0, 795.0], [171.0, 795.0]]}, {"label": "Prod_item_value", "transcription": "Weissbierdunkel", "points": [[172.0, 788.0], [455.0, 788.0], [455.0, 750.0], [172.0, 750.0]]}, {"label": "Prod_item_value", "transcription": "LowenbrauOriginal", "points": [[173.0, 742.0], [490.0, 742.0], [490.0, 708.0], [173.0, 708.0]]}, {"label": "Others", "transcription": "a", "points": [[511.0, 738.0], [527.0, 738.0], [527.0, 713.0], [511.0, 713.0]]}, {"label": "Others", "transcription": "a", "points": [[512.0, 782.0], [527.0, 782.0], [527.0, 758.0], [512.0, 758.0]]}, {"label": "Others", "transcription": "a", "points": [[512.0, 826.0], [529.0, 826.0], [529.0, 804.0], [512.0, 804.0]]}, {"label": "Others", "transcription": "a", "points": [[511.0, 1098.0], [527.0, 1098.0], [527.0, 1073.0], [511.0, 1073.0]]}, {"label": "Others", "transcription": "9,90", "points": [[564.0, 1101.0], [635.0, 1101.0], [635.0, 1066.0], [564.0, 1066.0]]}, {"label": "Others", "transcription": "3,30", "points": [[564.0, 829.0], [632.0, 829.0], [632.0, 795.0], [564.0, 795.0]]}, {"label": "Others", "transcription": "3,30", "points": [[564.0, 785.0], [633.0, 785.0], [633.0, 751.0], [564.0, 751.0]]}, {"label": "Others", "transcription": "3,00", "points": [[566.0, 743.0], [635.0, 743.0], [635.0, 707.0], [566.0, 707.0]]}, {"label": "Prod_price_value", "transcription": "12,00", "points": [[691.0, 742.0], [776.0, 742.0], [776.0, 706.0], [691.0, 706.0]]}, {"label": "Prod_price_value", "transcription": "26,40", "points": [[687.0, 786.0], [776.0, 786.0], [776.0, 750.0], [687.0, 750.0]]}, {"label": "Prod_price_value", "transcription": "9,90", "points": [[706.0, 830.0], [778.0, 830.0], [778.0, 795.0], [706.0, 795.0]]}, {"label": "Prod_price_value", "transcription": "2,50", "points": [[706.0, 873.0], [779.0, 873.0], [779.0, 840.0], [706.0, 840.0]]}, {"label": "Prod_price_value", "transcription": "2,40", "points": [[706.0, 922.0], [780.0, 922.0], [780.0, 885.0], [706.0, 885.0]]}, {"label": "Prod_price_value", "transcription": "9,90", "points": [[707.0, 967.0], [778.0, 967.0], [778.0, 931.0], [707.0, 931.0]]}, {"label": "Prod_price_value", "transcription": "8,90", "points": [[706.0, 1014.0], [780.0, 1014.0], [780.0, 976.0], [706.0, 976.0]]}, {"label": "Prod_price_value", "transcription": "12,90", "points": [[693.0, 1059.0], [780.0, 1059.0], [780.0, 1022.0], [693.0, 1022.0]]}, {"label": "Prod_price_value", "transcription": "19,80", "points": [[694.0, 1105.0], [781.0, 1105.0], [781.0, 1069.0], [694.0, 1069.0]]}, {"label": "Prod_price_value", "transcription": "6,90", "points": [[708.0, 1150.0], [782.0, 1150.0], [782.0, 1114.0], [708.0, 1114.0]]}, {"label": "Prod_price_value", "transcription": "11,90", "points": [[696.0, 1196.0], [783.0, 1196.0], [783.0, 1160.0], [696.0, 1160.0]]}, {"label": "Prod_price_value", "transcription": "13,90", "points": [[697.0, 1242.0], [784.0, 1242.0], [784.0, 1206.0], [697.0, 1206.0]]}, {"label": "Prod_price_value", "transcription": "14,90", "points": [[696.0, 1289.0], [785.0, 1289.0], [785.0, 1253.0], [696.0, 1253.0]]}, {"label": "Prod_price_value", "transcription": "2,10", "points": [[711.0, 1336.0], [784.0, 1336.0], [784.0, 1299.0], [711.0, 1299.0]]}, {"label": "Others", "transcription": "1", "points": [[807.0, 1333.0], [818.0, 1333.0], [818.0, 1301.0], [807.0, 1301.0]]}, {"label": "Others", "transcription": "1", "points": [[807.0, 1287.0], [818.0, 1287.0], [818.0, 1254.0], [807.0, 1254.0]]}, {"label": "Others", "transcription": "1", "points": [[805.0, 1241.0], [817.0, 1241.0], [817.0, 1210.0], [805.0, 1210.0]]}, {"label": "Others", "transcription": "1", "points": [[804.0, 1195.0], [816.0, 1195.0], [816.0, 1163.0], [804.0, 1163.0]]}, {"label": "Others", "transcription": "1", "points": [[804.0, 1150.0], [816.0, 1150.0], [816.0, 1118.0], [804.0, 1118.0]]}, {"label": "Others", "transcription": "1", "points": [[805.0, 1104.0], [814.0, 1104.0], [814.0, 1073.0], [805.0, 1073.0]]}, {"label": "Others", "transcription": "1", "points": [[804.0, 1055.0], [814.0, 1055.0], [814.0, 1024.0], [804.0, 1024.0]]}, {"label": "Others", "transcription": "1", "points": [[802.0, 1011.0], [814.0, 1011.0], [814.0, 979.0], [802.0, 979.0]]}, {"label": "Others", "transcription": "1", "points": [[801.0, 964.0], [813.0, 964.0], [813.0, 932.0], [801.0, 932.0]]}, {"label": "Others", "transcription": "1", "points": [[800.0, 918.0], [812.0, 918.0], [812.0, 887.0], [800.0, 887.0]]}, {"label": "Others", "transcription": "1", "points": [[801.0, 872.0], [812.0, 872.0], [812.0, 840.0], [801.0, 840.0]]}, {"label": "Others", "transcription": "1", "points": [[801.0, 827.0], [811.0, 827.0], [811.0, 795.0], [801.0, 795.0]]}, {"label": "Others", "transcription": "1", "points": [[799.0, 781.0], [811.0, 781.0], [811.0, 749.0], [799.0, 749.0]]}, {"label": "Others", "transcription": "1", "points": [[798.0, 737.0], [809.0, 737.0], [809.0, 705.0], [798.0, 705.0]]}, {"label": "Subtotal_key", "transcription": "Netto(1)", "points": [[53.0, 1428.0], [200.0, 1428.0], [200.0, 1388.0], [53.0, 1388.0]]}, {"label": "Tax_key", "transcription": "+19,0%MwSt:", "points": [[52.0, 1474.0], [303.0, 1474.0], [303.0, 1436.0], [52.0, 1436.0]]}, {"label": "Subtotal_value", "transcription": "Eur129,75", "points": [[252.0, 1427.0], [473.0, 1427.0], [473.0, 1390.0], [252.0, 1390.0]]}, {"label": "Tax_value", "transcription": "24,65", "points": [[380.0, 1478.0], [471.0, 1478.0], [471.0, 1438.0], [380.0, 1438.0]]}, {"label": "Total_key", "transcription": "Summe:", "points": [[40.0, 1603.0], [148.0, 1603.0], [148.0, 1535.0], [40.0, 1535.0]]}, {"label": "Others", "transcription": "EsbedienteSie:George", "points": [[37.0, 1654.0], [469.0, 1654.0], [469.0, 1611.0], [37.0, 1611.0]]}, {"label": "Total_value", "transcription": "Eur154,40", "points": [[565.0, 1617.0], [788.0, 1617.0], [788.0, 1537.0], [565.0, 1537.0]]}, {"label": "Tel_value", "transcription": "089-46169340", "points": [[386.0, 409.0], [599.0, 409.0], [599.0, 373.0], [386.0, 373.0]]}]

每一行是一条数据,数据前部分是图片路径,后半部分是标注信息,是一段JSON。JSON主要是三部分数据,一是标签,二是文本内容,三是区域。

这里的label还需要配合一个class_list使用,另外根据文字内容还需要对应的字典。

/
\
.
$
£
€
¥
:
-
,
*
#
(
)
%
@
!
'
&
=
>
+
"
Ignore
Store_name_value
Store_name_key
Store_addr_value
Store_addr_key
Tel_value
Tel_key
Date_value
Date_key
Time_value
Time_key
Prod_item_value
Prod_item_key
Prod_quantity_value
Prod_quantity_key
Prod_price_value
Prod_price_key
Subtotal_value
Subtotal_key
Tax_value
Tax_key
Tips_value
Tips_key
Total_value
Total_key
Others

下载数据

Paddle内置了一个下载方法,直接使用下载文件

class WildReceiptDataset(paddle.io.Dataset):
    NAME = "wildreceipt"
    DATASET_URL = "https://paddleocr.bj.bcebos.com/ppstructure/dataset/wildreceipt.tar"
    DATASET_MD5 = "0b9abbc025e85515247f8a464c7b44dc"

    def __init__(self, path=None, mode="train", transform=None, download=True):
        super(WildReceiptDataset, self).__init__()

        assert mode.lower() in [
            "train",
            "test",
        ], f"mode should be 'train' or 'test', but got {mode}"

        self.mode = mode.lower()
        self.path = path
        if self.path is None:
            assert (
                download
            ), "image_path is not set and downloading automatically is disabled"
            self.path = paddle.dataset.common.download(
                self.DATASET_URL, self.NAME, self.DATASET_MD5
            )

        self.transform = transform

数据会自动缓存,避免重复下载

解析数据

我们先简单获取图片和label信息,label信息直接返回json

class WildReceiptDataset(paddle.io.Dataset):
    def _parse_dataset(self, buffer_size=100):
        self.images = []
        self.labels = []

        main_file = "wildreceipt/wildreceipt_" + self.mode + ".txt"

        with tarfile.open(self.path) as tarFile:
            member = tarFile.getmember(main_file)
            f = tarFile.extractfile(member)
            if f is not None:
                content = f.read()

            text = content.decode("utf-8")
            lines = text.split("\n")

            for line in lines:
                if line == "":
                    continue
                substr = line.split("\t")
                file_name = substr[0]
                label = substr[1]

                self.images.append(file_name)
                self.labels.append(label)

然后在获取数据的时候解析图片

class WildReceiptDataset(paddle.io.Dataset):
NAME = "wildreceipt"
DATASET_URL = "https://paddleocr.bj.bcebos.com/ppstructure/dataset/wildreceipt.tar"
DATASET_MD5 = "0b9abbc025e85515247f8a464c7b44dc"

class WildReceiptDataset(paddle.io.Dataset):


    def __getitem__(self, idx):
        file_name = self.images[idx]
        with tarfile.open(self.path) as tarFile:
            image_data = io.BytesIO(
                tarFile.extractfile("wildreceipt/" + file_name).read()
            )
            image = Image.open(image_data)
        return image, self.labels[idx]


本文版权归作者所有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。

转载自夜明的孤行灯

本文链接地址: https://www.huangyunkun.com/2023/11/23/wildreceipt-paddle-dataset/

发表评论