Skip to content

Non-ascii unicode field names that encode to more than 10 bytes, can be corrupted by truncation w.r.t bytes, instead of code points (and no longer decodable as UTF-8) #416

Description

@JamesParrott

PyShp Version

3.latest,and at least as far back as 2.3.1

Python Version

3.14

Your code

import shapefile as shp

print(f"{shp.__version__=}")

with shp.Writer("delete_me") as w:
    w.field('ÀÀÀÀ०')
    print(f"{w.fields=}")
with shp.Reader("delete_me") as r:
    pass

Full stacktrace

>python field_name_bug.py
shp.__version__='2.3.1'
w.fields=[('ÀÀÀÀ०', 'C', '50', 0)] # name encodes to 11 bytes, the final char requiring 3
Traceback (most recent call last):
  File "C:\...\field_name_bug.py", line 8, in <module>
    with shp.Reader("delete_me") as r:
         ~~~~~~~~~~^^^^^^^^^^^^^
  File "C:\...\shapefile.py", line 1072, in __init__
    self.load(path)
    ~~~~~~~~~^^^^^^
  File "C:\...\shapefile.py", line 1221, in load
    self.__dbfHeader()
    ~~~~~~~~~~~~~~~~^^
  File "C:\...\shapefile.py", line 1550, in __dbfHeader
    fieldDesc[name] = u(fieldDesc[name], self.encoding, self.encodingErrors)
                      ~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\shapefile.py", line 128, in u
    return v.decode(encoding, encodingErrors)
           ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 8-9: unexpected end of data

Other notes

Do ArcGIS, QGIS or anything else support reading or allow creating shapefiles with non-ascii unicode in the field names?
Yes - Micah provided a link.

Does this break anything for our users if we forbid non-ascii unicode, or help them avoid other breakages elsewhere (i.e. is it non-compliant/broken already, Certainly
so should we break their code?). Heck no.

Or should we just I will fix the truncation to be code point aware, (perhaps warn either way if non-ascii).

These DBF specs only mention ascii field names:
https://en.wikipedia.org/wiki/.dbf#Field_descriptor_array

https://dbase.com/Knowledgebase/int/db7_file_fmt.htm

See above, and other issues. ArcGIS supports unicode, and many users want to store unicode in Shapefiles)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions