Dear TinyLlama Community,

First and foremost, we would like to extend our sincere apologies for the oversight in our project's data handling.

🔍 What Happened?

A bug (https://github.com/jzhang38/TinyLlama/issues/67) related to data shuffling in the datasetloader was found. Due to this, approximately 35% of data wasn't being loaded into the dataloader correctly, and some data might have been processed multiple times within a single epoch.

🛠 What Are We Doing About It?

We have already taken steps to rectify the issue. The code has been updated to fix the bug, and our course of action will be to manually terminate the current process, implement the corrected code, and continue our training from the 1T-token checkpoint.

How Does This Affect the Project Timeline?

Due to these necessary adjustments, our release schedule will be delayed by approximately half a month. We understand that many of you might have been eagerly anticipating our upcoming checkpoints and releases, and we deeply regret any inconvenience this might cause.

📄 Further Information:

We are committed to transparency, and we will be discussing this issue extensively in our upcoming technical report. We hope that this will shed light on the intricacies of the problem and our solutions.

🙏 A Note of Gratitude:

A huge thank you to our community members who brought this to our attention. Your active participation ensures the robustness and quality of the TinyLlama project. We value each feedback, comment, and suggestion from all of you.

Best,