I’m working on a JavaScript app in NodeJs v22.11.0 with @octokit/rest 21.0.2 and crypto-js 4.2.0 and I’m encountering an issue where the text content of a file with accents (e.g., in Spanish) gets corrupted when pushed and pulled via the GitHub API. Specifically, if I push a file with the string "Quedar con la tía María"
, (“meet up with aunt María” in Spanish if you are curious), and then pull it back, I end up with incorrect characters in the decoded output. Depending on the decoding method used, I get one of these results:
"Quedar con la t�a Mar�a"
"Quedar con la tÃa MarÃa"
This issue occurs whether I use Buffer.from(repoFile.base64content, ‘base64’).toString(‘utf-8’) or atob(repoFile.base64content).
Additionally, the SHA hash calculated for the file after decoding is different from the original GitHub SHA. The SHA calculation works fine when there are no accented characters.
Here’s a minimal example to reproduce the issue:
import { Octokit } from "@octokit/rest";
import CryptoJS from 'crypto-js';
const octokit = new Octokit({ auth: 'personal-access-token' });
// Local file content with accented characters
const localFileContentString = 'Quedar con la tía María';
const localFile = {
path: 'Recordar.md',
sha: getSha(localFileContentString),
content: localFileContentString,
base64Content: btoa(localFileContentString)
};
// Function to calculate SHA1 of file content
function getSha(fileContents) {
const size = fileContents.length;
const blobString = `blob ${size} ${fileContents}`;
return CryptoJS.SHA1(blobString).toString(CryptoJS.enc.Hex);
}
// Fetch the file content from GitHub repo
async function getRepoFile() {
const existingFileResponse = await octokit.repos.getContent({
owner: 'github-username',
repo: 'vault-name',
path: localFile.path
});
return { sha: existingFileResponse.data.sha, base64content: existingFileResponse.data.content.replace(/n/g, '') };
}
const repoFile = await getRepoFile();
console.log('EncodedRepoFile', repoFile, "n");
console.log('EncodedLocalFile', localFile, "n");
// Decode the base64 content from both the repo and local file
console.log('DecodedRepoFile', Buffer.from(repoFile.base64content, 'base64').toString());
console.log('DecodedLocalFile', Buffer.from(localFile.base64Content, 'base64').toString());
And the code output:
EncodedRepoFile {
sha: '9fe35536cd6188e428ee04dcb559d69ecfb4d5d9',
base64content: 'UXVlZGFyIGNvbiBsYSB0w61hIE1hcsOtYQoK'
}
EncodedLocalFile {
path: 'Recordar.md',
sha: '9860966172762a56f5b3dec12d51d4b1fb1034e8',
content: 'Quedar con la tía María',
base64Content: 'UXVlZGFyIGNvbiBsYSB07WEgTWFy7WE='
}
DecodedRepoFile Quedar con la tía María
DecodedLocalFile Quedar con la t�a Mar�a
I think the problem relies on how GitHub itself handles this special characters and I don’t know how to work my way around it. I’m using UFT-8 and I’ve tried changing the encoding to ISO-8859-1, getting the SHA of the corrupted string to at least check if I got the same SHA and checked through all the code that the encoding is consistent with libs like iconv-lite
and chardet
but none of that works.