Commit graph

106 commits

Author SHA1 Message Date
Scott Shawcroft
7f0cc9e7b4
Add Zephyr port
This port is meant to grow to encompass all existing boards. For
now, it is a port while we transition over.

It is named `zephyr-cp` to differentiate it from the MicroPython
`zephyr` port. They are separate implementations.
2025-02-04 11:24:13 -08:00
Dan Halbert
ac7e15f88a (only) reserve merge conflicts 2024-08-28 16:31:37 -04:00
Dan Halbert
be6fa2af21 merge from main 2024-07-29 17:41:46 -04:00
Dan Halbert
71f17b08fb wip: fixing compilation 2024-07-26 18:38:46 -04:00
Dan Halbert
69b667406b MPy v1.22 merge: initial merge; not compiled yet 2024-07-25 15:16:24 -04:00
Jim Mussared
d694ac6e1b py/makeqstrdata.py: Ensure that scope names get low qstr values.
Originally implemented in a patch file provided by @ironss-iotec.

Fixes issue #14093.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
2024-03-26 22:52:25 +11:00
Jim Mussared
7ea503929a py/qstr: Add support for MICROPY_QSTR_BYTES_IN_HASH=0.
This disables using qstr hashes altogether, which saves RAM and flash
(two bytes per interned string on a typical build) as well as code size.
On PYBV11 this is worth over 3k flash.

qstr comparison will now be done just by length then data. This affects
qstr_find_strn although this has a negligible performance impact as, for a
given comparison, the length and first character will ~usually be
different anyway.

String hashing (e.g. builtin `hash()` and map.c) now need to compute the
hash dynamically, and for the map case this does come at a performance
cost.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
2024-01-25 16:38:17 +11:00
51e3d5ecbf
Add a note about this makeqstrdata change 2023-12-24 10:41:54 -06:00
40722741dd
makeqstrdata: ensure certain qstrs are as early as possible
Some qstrs like those representing binary ops such as __add__ must have
qstr numbers that fit in 8 bits. Replace the former ad-hoc method,
which sorted *other* dunder-identifiers early with a list of all
the qstrs that have this requirement.

Before this, the unix coverage build was failing when I added certain
qstrs like "<input>" for a codeop filename default value.
2023-12-14 17:19:26 -06:00
Jim Mussared
64c79a5423 py/qstr: Add support for sorted qstr pools.
This provides a significant performance boost for qstr_find_strn, which is
called a lot during parsing and loading of .mpy files, as well as interning
of string objects (which happens in most string methods that return new
strings).

Also adds comments to explain the "static" qstrs.  These are part of the
.mpy ABI and avoid needing to duplicate string data for QSTRs known to
already be in the firmware.  The static pool isn't currently sorted, but in
the future we could either split the static pool into the sorted regions,
or in the next .mpy version just sort them.

Based on initial work done by @amirgon in #6896.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
2023-10-30 11:10:02 +11:00
Dan Halbert
367e13c69f change CIRCUITPY change markers to CIRCUITPY-CHANGE 2023-10-19 16:42:36 -04:00
Dan Halbert
f2ebe6839c Initial MicroPython v1.21.0 merge; not compiled yet 2023-10-18 17:49:14 -04:00
9104654930
makeqstrdata: ensure _lt and _gt qstrs are sorted early
this fixes a build error because their numbers have to be <256
2023-09-22 10:38:52 -05:00
Dan Halbert
d582407b06 pre-commit fixes 2023-08-14 00:59:22 -04:00
Dan Halbert
48e404ab90 wip; fix frozen; still need to fix -j1 for frozen 2023-08-12 15:44:27 -04:00
Dan Halbert
8a89a3d425 sort TRANSLATION()'s 2023-08-11 13:36:03 -04:00
Dan Halbert
fe0e2f13bc wip; fix qstr processing 2023-08-10 20:06:32 -04:00
Jim Mussared
8b27482692 top: Update Python formatting to black "2023 stable style".
See https://black.readthedocs.io/en/stable/the_black_code_style/index.html

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
2023-02-02 12:51:03 +11:00
MicroDev
d9d94eacca
run updated pre-commit 2023-02-01 13:38:41 +05:30
Scott Shawcroft
fd5ef009a4
Move compressed strings into own object file
This breaks the translation dependency to all of the other objects
and therefore speeds up subsequent builds. Now, even when the big
translate() function is inlined in the header, it only needs to be
optimized once.
2022-06-02 11:48:56 -07:00
Scott Shawcroft
9d10a3da66
Conditionalize LTO 2022-05-27 12:59:54 -07:00
Artyom Skrobov
18b1ba086c py/qstr: Separate hash and len from string data.
This allows the compiler to merge strings: e.g. "update",
"difference_update" and "symmetric_difference_update" will all point to the
same memory.

No functional change.

The size reduction depends on the number of qstrs in the build.  The change
this commit brings is:

   bare-arm:    -4 -0.007%
minimal x86:  +150 +0.092% [incl +48(data)]
   unix x64:  -608 -0.118%
unix nanbox:  -572 -0.126% [incl +32(data)]
      stm32: -1392 -0.352% PYBV10
     cc3200:  -448 -0.244%
    esp8266: -1208 -0.173% GENERIC
      esp32: -1028 -0.068% GENERIC[incl -1020(data)]
        nrf:  -440 -0.252% pca10040
        rp2: -1072 -0.217% PICO
       samd:  -368 -0.264% ADAFRUIT_ITSYBITSY_M4_EXPRESS

Performance is also improved (on bare metal at least) for the
core_import_mpy_multi.py, core_import_mpy_single.py and core_qstr.py
performance benchmarks.

Originally at adafruit#4583

Signed-off-by: Artyom Skrobov <tyomitch@gmail.com>
2022-02-11 22:52:32 +11:00
Jeff Epler
d59a28db97 Compress word offset table
By storing "count of words by length", the long `wends` table can be
replaced with a short `wlencount` table.  This saves flash storage space.

Extend the range of string lengths that can be in the dictionary.
Originally it was to 2 to 9; at one point it was changed to 3 to 9.
Putting the lower bound back at 2 has a positive impact on the French
translation (a bunch of them, such as "ch", "\r\n", "%q", are used).
Increasing the maximum length gets 'mpossible', ' doit être ',
and 'CircuitPyth' at the long end.  This adds a bit of processing time
to makeqstrdata. The specific 2/11 values are again empirical based on
the French translation on the adafruit_proxlight_trinkey_m0.
2021-08-07 09:23:35 -05:00
Jeff Epler
0b8b16f6ac increase comment on accuracy of the net savings estimate function
Thanks to tyomitch for suggesting the comment could be more accurate.
2021-07-11 08:57:27 -05:00
52e75c645d makeqstrdata: Don't include strings that are a net loss! 2021-07-09 14:26:43 -05:00
8836198ff1 TextSplitter: don't mutate 'words'
I was puzzled by why the dictionary words were sorted by length.
It was because TextSplitter sorted its parameter, instead of a copy.

This doesn't affect encoding size, but does affect the encoding NUMBER
of the found words.  We'll deliberately restore sorting by length next,
for other reasons, but not by spooky action.
2021-07-09 14:02:31 -05:00
99abd03b7a makeqstrdata: use an extremely accurate dictionary heuristic
Try to accurately measure the costs of including a word in the dictionary
vs the gains from using it in messages.

This saves about 160 bytes on trinket_m0 ja, the fullest translation
for that board.  Other translations on the same board all have savings,
ranging from 24 to 228 bytes.

```
Translation     Before  After   Savings
ja              1164    1324    160
de_DE           1260    1396    136
fr              1424    1652    228
zh_Latn_pinyin  1448    1520    72
pt_BR           1584    1736    152
pl              1592    1640    48
es              1724    1816    92
ko              1724    1816    92
fil             1764    1800    36
it_IT           1896    2040    144
nl              1956    2136    180
ID              2072    2180    108
cs              2124    2148    24
sv              2340    2448    108
en_x_pirate     2644    2740    96
en_GB           2652    2752    100
el              2656    2768    112
en_US           2656    2768    112
hi              2656    2768    112
```
2021-07-09 12:45:49 -05:00
45dc0953a5 makeqstrdata.py: Remove a problematic print
.. it contained non-ASCII characters, even when building the standard
English translation.

This may help resolve the build problems reported at #4750.
2021-05-11 21:48:21 -05:00
Scott Shawcroft
b35fa44c8a
Merge MicroPython 1.12 into CircuitPython 2021-05-03 14:01:18 -07:00
Jeff Epler
dfa7c3d32d codeformat: Fix handling of **
After discussing with danh, I noticed that `a/**/b` would not match `a/b`.

After correcting this and re-running "pre-commit run --all", additional
files were reindented, including the codeformat script itself.
2021-04-30 15:30:13 -05:00
Scott Shawcroft
76033d5115
Merge MicroPython v1.11 into CircuitPython 2021-04-26 15:47:41 -07:00
Scott Shawcroft
e54e5e3575
Merge pull request #4564 from tyomitch/patch-1
[build] simplify makeqstrdata heuristic
2021-04-19 14:50:42 -07:00
Artyom Skrobov
dcee89ade7 build: simplify compute_huffman_coding()
No functional change.
2021-04-09 08:36:26 -04:00
Artyom Skrobov
68920682b6 [build] simplify makeqstrdata heuristic
The simpler one saves, on average, 51 more bytes per translation;
the biggest translation per board is reduced, on average, by 85 bytes.
2021-04-09 07:18:40 -04:00
Artyom Skrobov
c3e40d50ab [qstr] Separate hash and len from string data
This allows the compiler to merge strings: e.g. "update",
"difference_update" and "symmetric_difference_update"
will all point to the same memory.

Shaves ~1KB off the image size, and potentially allows
bigger savings if qstr attrs are initialized in qstr_init(),
and not stored in the image.
2021-04-06 12:58:42 -04:00
microDev
a52eb88031
run code formatting script 2021-03-15 19:27:36 +05:30
0318eb359f makeqstrdata: Work around python3.6 compatibility problem
Discord user Folknology encountered a problem building with Python 3.6.9,
`TypeError: ord() expected a character, but string of length 0 found`.

I was able to reproduce the problem using Python3.5*, and discovered that
the meaning of the regular expression `"|."` had changed in 3.7.  Before,
```
>>> [m.group(0) for m in re.finditer("|.", "hello")]
['', '', '', '', '', '']
```
After:
```
>>> [m.group(0) for m in re.finditer("|.", "hello")]
['', 'h', '', 'e', '', 'l', '', 'l', '', 'o', '']
```
Check if `words` is empty and if so use `"."` as the regular expression
instead.  This gives the same result on both versions:
```
['h', 'e', 'l', 'l', 'o']
```
and fixes the generation of the huffman dictionary.

Folknology verified that this fix worked for them.

 * I could easily install 3.5 but not 3.6.  3.5 reproduced the same problem
2020-09-21 10:03:07 -05:00
bfbbbd6c5c makeqstrdata: Work with older Python
This construct (which I added without sufficient testing,
apparently) is only supported in Python 3.7 and newer.  Make it
optional so that this script works on other Python versions.  This
means that if you have a system with non-UTF-8 encoding you will
need to use Python 3.7.

In particular, this affects a problem building circuitpython in
github's ubuntu-18.04 virtual environment when Python 3.7 is not
explicitly installed.  cookie-cuttered libraries call for Python
3.6:
```
    - name: Set up Python 3.6
      uses: actions/setup-python@v1
      with:
        python-version: 3.6
```
Since CircuitPython's own build calls for 3.8, this problem was not
detected.

This problem was also encountered by discord user mdroberts1243.

The failure I encountered was here:
https://github.com/jepler/Jepler_CircuitPython_udecimal/runs/1138045020?check_suite_focus=true
.. while my step of "clone and build circuitpython unix port" is
unusual, I think the same problem would have affected "build assets"
if that step had been reached.
2020-09-19 10:16:13 -05:00
a8e98cda83 makeqstrdata: comment my understanding of @ciscorn's code 2020-09-16 08:28:15 -05:00
Taku Fukada
d18d79ac47 Small improvements to the dictionary compression 2020-09-14 01:50:01 +09:00
15964a4750 makeqstrdata: Avoid encoding problems
Most users and the CI system are running in configurations where Python
configures stdout and stderr in UTF-8 mode.  However, Windows is different,
setting values like CP1252.  This led to a build failure on Windows, because
makeqstrdata printed Unicode strings to its stdout, expecting them to be
encoded as UTF-8.

This script is writing (stdout) to a compiler input file and potentially
printing messages (stderr) to a log or console.  Explicitly configure stdout to
use utf-8 to get consistent behavior on all platforms, and configure stderr so
that if any log/diagnostic messages are printed that cannot be displayed
correctly, they are still displayed instead of creating an error while trying
to print the diagnostic information.

I considered setting the encodings both to ascii, but this would just be
occasionally inconvenient to developers like me who want to show diagnostic
info on stderr and in comments while working with the compression code.

Closes: #3408
2020-09-12 19:43:08 -05:00
40ab5c6b21 compression: Implement ciscorn's dictionary approach
Massive savings.  Thanks so much @ciscorn for providing the initial
code for choosing the dictionary.

This adds a bit of time to the build, both to find the dictionary
but also because (for reasons I don't fully understand), the binary
search in the compress() function no longer worked and had to be
replaced with a linear search.

I think this is because the intended invariant is that for codebook
entries that encode to the same number of bits, the entries are ordered
in ascending value.  However, I mis-placed the transition from "words"
to "byte/char values" so the codebook entries for words are in word-order
rather than their code order.

Because this price is only paid at build time, I didn't care to determine
exactly where the correct fix was.

I also commented out a line to produce the "estimated total memory size"
-- at least on the unix build with TRANSLATION=ja, this led to a build
time KeyError trying to compute the codebook size for all the strings.
I think this occurs because some single unicode code point ('ァ') is
no longer present as itself in the compressed strings, due to always
being replaced by a word.

As promised, this seems to save hundreds of bytes in the German translation
on the trinket m0.

Testing performed:
 - built trinket_m0 in several languages
 - built and ran unix port in several languages (en, de_DE, ja) and ran
   simple error-producing codes like ./micropython -c '1/0'
2020-09-12 10:10:45 -05:00
bdb07adfcc translations: Make decompression clearer
Now this gets filled in with values e.g., 128 (0x80) and 159 (0x9f).
2020-09-08 19:07:53 -05:00
cbfd38d1ce Rename functions to encode_ngrams / decode_ngrams 2020-09-02 19:09:23 -05:00
c34cb82ecb makeqstrdata: correct range of low code points to 0x80..0x9f inclusive
The previous range was unintentionally big and overlaps some characters
we'd like to use (and also 0xa0, which we don't intentionally use)
2020-09-02 15:52:02 -05:00
Jeff Epler
07740d19f3 add bigram compression to makeqstrdata
Compress common unicode bigrams by making code points in the range
0x80 - 0xbf (inclusive) represent them.  Then, they can be greedily
encoded and the substituted code points handled by the existing Huffman
compression.  Normally code points in the range 0x80-0xbf are not used
in Unicode, so we stake our own claim.  Using the more arguably correct
"Private Use Area" (PUA) would mean that for scripts that only use
code points under 256 we would use more memory for the "values" table.

bigram means "two letters", and is also sometimes called a "digram".
It's nothing to do with "big RAM".  For our purposes, a bigram represents
two successive unicode code points, so for instance in our build on
trinket m0 for english the most frequent are:
['t ', 'e ', 'in', 'd ', ...].

The bigrams are selected based on frequency in the corpus, but the
selection is not necessarily optimal, for these reasons I can think of:
 * Suppose the corpus was just "tea" repeated 100 times.  The
   top bigrams would be "te", and "ea".  However,
   overlap, "te" could never be used.  Thus, some bigrams might actually
   waste space
    * I _assume_ this has to be why e.g., bigram 0x86 "s " is more
      frequent than bigram 0x85 " a" in English for Trinket M0, because
      sequences like "can't add" would get the "t " digram and then
      be unable to use the " a" digram.

 * And generally, if a bigram is frequent then so are its constituents.
   Say that "i" and "n" both encode to just 5 or 6 bits, then the huffman
   code for "in" had better compress to 10 or fewer bits or it's a net
   loss!
    * I checked though!  "i" is 5 bits, "n" is 6 bits (lucky guess)
      but the bigram 0x83 also just 6 bits, so this one is a win of
      5 bits for every "it" minus overhead.  Yay, this round goes to team
      compression.
    * On the other hand, the least frequent bigram 0x9d " n" is 10 bits
      long and its constituent code points are 4+6 bits so there's no
      savings, but there is the cost of the table entry.
    * and somehow 0x9f 'an' is never used at all!

With or without accounting for overlaps, there is some optimum number
of bigrams.  Adding one more bigram uses at least 2 bytes (for the
entry in the bigram table; 4 bytes if code points >255 are in the
source text) and also needs a slot in the Huffman dictionary, so
adding bigrams beyond the optimim number makes compression worse again.

If it's an improvement, the fact that it's not guaranteed optimal
doesn't seem to matter too much.  It just leaves a little more fruit
for the next sweep to pick up.  Perhaps try adding the most frequent
bigram not yet present, until it doesn't improve compression overall.

Right now, de_DE is again the "fullest" build on trinket_m0.  (It's
reclaimed that spot from the ja translation somehow)  This change saves
104 bytes there, increasing free space about 6.8%.  In the larger
(but not critically full) pyportal build it saves 324 bytes.

The specific number of bigrams used (32) was chosen as it is the max
number that fit within the 0x80..0xbf range.  Larger tables would
require the use of 16 bit code points in the de_DE build, losing savings
overall.

(Side note: The most frequent letters in English have been said
to be: ETA OIN SHRDLU; but we have UAC EIL MOPRST in our corpus)
2020-09-01 17:12:22 -05:00
Taku Fukada
79a3796b1c Calculate the Huffman codebook without MP_QSTRs 2020-08-18 23:21:14 +09:00
08ed09acc6 makeqstrdata: don't print "compression incrased length" messages
This check as implemented is misleading, because it compares the
compressed size in bytes (including the length indication) with the source
string length in Unicode code points.  For English this is approximately
fair, but for Japanese this is quite unfair and produces an excess of
"increased length" messages.

This message might have existed for one of two reasons:
 * to alert to an improperly function huffman compression
 * to call attention to a need for a "string is stored uncompressed" case
We know by now that the huffman compression is functioning as designed and
effective in general.

Just to be on the safe side, I did some back-of-the-envelope estimates.
I considered these three replacements for "the true source string size, in bytes":
+    decompressed_len_utf8 = len(decompressed.encode('utf-8'))
+    decompressed_len_utf16 = len(decompressed.encode('utf-16be'))
+    decompressed_len_bitsize = ((1+len(decompressed)) * math.ceil(math.log(1+len(values), 2)) + 7) // 8

The third counts how many bits each character requires (fewer than 128
characters in the source character set = 7, fewer than 256 = 8, fewer than 512
= 9, etc, adding a string-terminating value) and is in some way representative
of the best way we would be able to store "uncompressed strings".  The Japanese
translation (largest as of writing) has just a few strings which increase by
this metric.  However, the amount of loss due to expansion in those cases is
outweighed by the cost of adding 1 bit per string to indicate whether it's
compressed or not.  For instance, in the BOARD=trinket_m0 TRANSLATION=ja build
the loss is 47 bytes over 300 strings.  Adding 1 bit to each of 300 strings will
cost about 37 bytes, leaving just 5 Thumb instructions to implement the code to
check and decode "uncompressed" strings in order to break even.
2020-08-16 20:50:48 -05:00
d0f9b5901e translations: document the compressed format 2020-05-28 11:30:46 -05:00
fe3e8d1589 string compression: save a few bits per string
Length was stored as a 16-bit number always.  Most translations have
a max length far less.  For example, US English translation lengths
always fit in just 8 bits.  probably all languages fit in 9 bits.

This also has the side effect of reducing the alignment of
compressed_string_t from 2 bytes to 1.

testing performed: ran in german and english on pyruler, printed messages
looked right.

Firmware size, en_US
Before: 3044 bytes free in flash
After: 3408 bytes free in flash

Firmware size, de_DE (with #2967 merged to restore translations)
Before: 1236 bytes free in flash
After: 1600 bytes free in flash
2020-05-28 08:36:08 -05:00