Viện Dữ liệu Mở (ODI) sẽ giúp phát triển một tiêu chuẩn siêu dữ liệu mở cho dữ liệu máy học

Thứ ba - 27/08/2024 05:29

The ODI to help develop an open metadata standard for machine learning data

Wed Mar 6, 2024

Theo: https://theodi.org/news-and-events/blog/the-odi-to-help-develop-an-open-metadata-standard-for-machine-learning-data/

Bài được đưa lên Internet ngày: 06/03/2024

MLCommons đã công bố phát hành Croissant, một định dạng siêu dữ liệu để giúp tiêu chuẩn hóa tài liệu của các tập dữ liệu máy học - ML (Machine Learning). Croissant được thiết lập để tạo ra sự khác biệt lớn đối với hoạt động xử lý dữ liệu trong AI - khi những người thực hành AI áp dụng nó để mô tả các tập dữ liệu của họ và nhiều nền tảng AI hơn hỗ trợ các tập dữ liệu có chú thích Croissant. Điều này hứa hẹn trở thành người thay đổi cuộc chơi trong AI an toàn và có đạo đức, nơi các tập dữ liệu chất lượng cao, được ghi thành tài liệu tốt là thiết yếu.

Hiện hành, nhiều tập dữ liệu ML không có đủ tài liệu máy đọc được để cho phép mọi người sử dụng chúng có trách nhiệm. Không có thông tin này, việc tìm kiếm, hiểu, và sử dụng các tập dữ liệu đó một cách an toàn và có đạo đức có thể rất mất thời gian.

Croissant có mục đích làm cho dữ liệu truy cập được và có khả năng khám phá được dễ dàng hơn. Nó cho phép các tập dữ liệu được tải lên các nền tảng AI khác nhau mà không cần định dạng lại. Người dùng xuất bản một tập dữ liệu ở định dạng Croissant hưởng lợi từ ‘trình biên tập Croissant’ (Croissant Editor), nó cho phép họ dễ dàng kiểm tra, tạo lập, hoặc sửa đổi các mô tả Croissant cho các tập dữ liệu của họ. Cũng có Thư viện Python MLCroissant để hỗ trợ lập trình.

ODI từng là một người ủng hộ sớm sáng kiến này, với Giám đốc Nghiên cứu của chúng tôi GS. Elena Simperl đồng chủ tịch nhóm công tác Cro issant. Hướng về tương lại, ODI sẽ giúp thúc đẩy Croissant theo vài cách thức, bao gồm việc thí điểm và đánh giá tiêu chuẩn này trong các tập dữ liệu ML chính, và quảng bá Croissant tới cộng đồng AI/ML rộng lớn hơn, đặc biệt ở Vương quốc Anh và châu Âu.

ODI có hồ sơ theo dõi mở rộng việc thiết kế, đánh giá, và thúc đẩy các tiêu chuẩn dữ liệu mở trong nhiều lĩnh vực, bao gồm cả tiêu chuẩn Ngân hàng Mở của Vương quốc Anh, tiêu chuẩn OpenAcitve, và Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) Data4Policy. Các tiêu chuẩn mở và hạ tầng dữ liệu tương hợp được là cốt lõi của kế hoạch 15 điểm cho chương trình AI lấy dữ liệu làm trung tâm (data-centric AI) của chúng tôi. Cùng với công việc của chúng tôi về hạ tầng dữ liệu, quản trị và điều hành dữ liệu, chúng tôi hướng tới việc xây dựng cộng đồng toàn cầu và thúc đẩy áp dụng Croissant.

“Dữ liệu là yếu tố rất quan trọng đối với hiệu suất của bất kỳ mô hình nào và như một số chuyên gia đề xuất, dữ liệu sẽ cạn kiệt, khiến nhu cầu khai thác dữ liệu càng trở nên quan trọng hơn. Croissant cho phép nhiều người hơn làm được nhiều việc hơn với dữ liệu. Với tư cách là đồng chủ tịch của nhóm làm việc, tôi rất vinh dự được cộng tác với các nhà khoa học và kỹ sư máy học đẳng cấp thế giới trên toàn cầu, đóng góp to lớn cho hệ sinh thái dữ liệu AI.”

Giáo sư Elena Simperl

Giám đốc Nghiên cứu tại ODI, Giáo sư Khoa học Máy tính tại Cao đẳng Hoàng gia Luân Đôn và đồng chủ trì nhóm công tác Croissant

Croissant được làm cho có thể nhờ các nỗ lực của nhóm công tác Croissant MLCommons bao gồm những người đóng góp từ các tổ chức: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Google, Harvard, Hugging Face, Kaggle, King's College London, the ODI, Meta, NASA, Open University of Catalonia - Luxembourg Institute of Science and Technology, và TU Eindhoven.

Bạn có thể ra nhập Nhóm Công tác Croi ssant, đóng góp cho kho GitHub, và tải về Croissant Editor để triển khai từ vựng Croissant trong các tập dữ liệu hiện có của bạn.

MLCommons has announced the release of Croissant, a metadata format to help standardise the documentation of machine learning (ML) datasets. Croissant is set to make a huge difference to data practices in AI - as AI practitioners adopt it to describe their datasets and more AI platforms support Croissant-annotated datasets. This promises to be a game changer in AI safety and ethics, where high-quality, well-documented datasets are essential.

Currently, many ML datasets lack sufficient machine-readable documentation to allow people to use them responsibly. Without this information, finding, understanding, and using these datasets safely and ethically can be very time-consuming.

Croissant aims to make data more easily accessible and discoverable. It enables datasets to be loaded into different AI platforms without the need for reformatting. Users looking to publish a dataset in the Croissant format benefit from the ‘Croissant editor’, which allows them to easily inspect, create, or modify Croissant descriptions for their datasets. There is also the MLCroissant Python Library for programmatic support.

The ODI has been an early supporter of the initiative, with our Director of Research Prof Elena Simperl co-chairing the Croissant working group. Moving forward, the ODI will help to advance Croissant in several ways, including piloting and evaluating the standard on key ML datasets, and promoting Croissant to the wider AI/ML community, in particular in the UK and Europe.

The ODI has an extensive track record designing, evaluating, and promoting open data standards in multiple domains, including the UK Open Banking standard, the OpenActive standard, and the Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) Data4Policy. Open standards and interoperable data infrastructure are at the core of the 15-point plan for our data-centric AI programme. Together with our work on data infrastructure, data stewardship and governance, we look forward to building a global community and fostering the adoption of Croissant.

“Data is a critical element of any model's performance, and as some experts suggest, it will run out, making the need to harness it even more important. Croissant allows more people to do more with data. As co-chair of the working group, it is a privilege to collaborate with world-class machine learning scientists and engineers around the globe, making an enormous contribution to the AI data ecosystem.”

Prof Elena Simperl

Director of Research at the ODI, Professor of Computer Science at King’s College London and co-chair of the Croissant working group

Croissant is made possible thanks to efforts by the MLCommons Croissant working group, which includes contributors from these organisations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Google, Harvard, Hugging Face, Kaggle, King's College London, the ODI, Meta, NASA, Open University of Catalonia - Luxembourg Institute of Science and Technology, and TU Eindhoven.

You can join the Croissant Working Group, contribute to the GitHub repository, and download the Croissant Editor to implement the Croissant vocabulary on your existing datasets.

Dịch: Lê Trung Nghĩa

letrungnghia.foss@gmail.com

Tác giả: Nghĩa Lê Trung