🔄 Recent Progress:
Here are the latest updates about the TinyLlama project. Thanks to our community, we've made significant progress, as evidenced by the noticeable reduction in both training and validation losses. We would like to share with the community some recent updates in the past few days.
🔄 Training Environment Transition:
Due to the unavailability of the original servers used for training, we have migrated TinyLlama's training to a new setup. However, this migration has caused a delay of several days. This transition involved completely rebuilding the pretrained data from scratch. A key adjustment in the preprocessing phase was the removal of a line of code, which curtailed the excessive addition of end-of-sequence (EOS) tokens. Since we used the same pipeline for processing all our data, the EOS token issue will happen in both training and validation data.
However, we decide to continue to use the original validation datasets for consistency. In the near future, we will craft a new validation set without the EOS token issue, allowing for more accurate evaluations at each checkpoint.
🙏 Appreciation for Our Community:
We are grateful to the community for continuously offering support and feedback on this work. TinyLlama is truly a community effort, and our decision to conduct it as open, live research has proven to be right. Your continuous support and feedback will be invaluable in further enhancing TinyLlama's robustness and quality. Thank you again for your ongoing support of the TinyLlama project!