Minh bạch dữ liệu AI: hiểu các nhu cầu và hiện trạng

Thứ hai - 26/08/2024 06:05

AI data transparency: understanding the needs and current state of play

Có rất ít sự minh bạch về dữ liệu được sử dụng trong các hệ thống AI - một thực tế gây ra mối lo ngại gia tăng khi các hệ thống đó ngày càng được sử dụng với các hệ lụy của thế giới thực.

Mon Jun 24, 2024

Theo: https://theodi.org/news-and-events/blog/ai-data-transparency-understanding-the-needs-and-current-state-of-play/#main

Bài được đưa lên Internet ngày: 24/06/2024

Khi các hệ thống AI ngày càng được sử dụng trong công việc và cuộc sống hàng ngày, việc hiểu các khía cạnh chính cách các hệ thống đó đã được tạo ra như thế nào và tin tưởng các kết quả đầu ra xa đến thế nào đang ngày càng trở nên thiết yếu hơn.

Như chúng tôi đã viết trong một bài báo được xuất bản gần đây trên Harvard Business Review, các nguồn dữ liệu khổng lồ, khó sử dụng và mù mờ không rõ ràng được sử dụng như là cơ sở cho việc sản xuất các kết quả đầu ra của các hệ thống AI tạo sinh (Generative AI). Việc không ghi lại thành tài liệu một cách công khai các nội dung và việc sử dụng các tập dữ liệu cản trở khả năng của các nhà phát triển, các nhà nghiên cứu, các nhà đạo đức học, và các nhà hoạch định chính sách để giải quyết các vấn đề khác nhau như các thành kiến, nội dung độc hại, lo ngại về bản quyền, và các rủi ro cho dữ liệu cá nhân và dữ liệu nhạy cảm. Sự thiếu tài liệu này lan sang tất cả các yếu tố của dữ liệu, bao gồm cả các tệp dữ liệu đào tạo và tinh chỉnh, cũng như các quy trình xác định gốc gác và gắn nhãn.

Trong bối cảnh cần minh bạch hơn trong thực hành dữ liệu AI, việc thiếu các phương pháp giám sát có hệ thống vẫn tồn tại trên nhiều hệ thống. Một nghiên cứu vào tháng 10 năm 2023 về 10 mô hình AI (“nền tảng”) chủ chốt của các nhà nghiên cứu Stanford đã nhấn mạnh rằng trong số các mô hình nói chung có tính minh bạch thấp khắp trong quá trình phát triển hệ thống AI, thì tính minh bạch về dữ liệu đặc biệt kém. Một bản cập nhật được phát hành gần đây cho nghiên cứu bao gồm một số mô hình khác và đã ghi nhận sự cải thiện nhỏ của một số nhà phát triển, nhưng nhìn chung, tính minh bạch dữ liệu vẫn còn kém.

Trong một nghiên cứu sắp tới của các thành viên trong nhóm nghiên cứu AI lấy dữ liệu làm trung tâm (Data-c entric AI) của chúng tôi, chúng tôi đã sao chép phân tích trên phạm vi rộng hơn gồm 54 hệ thống AI đang gây lo ngại cho công chúng, vốn là trung tâm của các sự cố AI được ghi lại trong Cơ sở dữ liệu sự cố AI của Quan hệ đối tác AI. Chúng tôi nhận thấy rằng chỉ một số ít các hệ thống AI này cung cấp thông tin có thể nhận dạng được về các mô hình cơ bản và cách thực hành dữ liệu của chúng. Điểm số về tính minh bạch (được đánh giá cho những hệ thống cung cấp thông tin về tính minh bạch của mô hình cơ bản) ở mức thấp trên tất cả các chỉ số bao gồm kích thước dữ liệu, nguồn và giám tuyển dữ liệu, với từng chỉ số hiện diện trong ít hơn 40% mô hình được đánh giá. Hầu như không có hệ thống nào ghi điểm bao gồm thông tin về việc đưa dữ liệu có bản quyền, thông tin cá nhân trong dữ liệu hoặc việc sử dụng giấy phép dữ liệu.

Để dựa trên những phát hiện của mình, chúng tôi đang phát triển chỉ số minh bạch dữ liệu AI để cung cấp bức tranh rõ ràng hơn về mức độ minh bạch dữ liệu khác nhau giữa các loại nhà cung cấp hệ thống khác nhau, dựa trên sự hiểu biết sâu sắc hơn về nhu cầu đối với thông tin đó. Việc điều tra nhu cầu minh bạch dữ liệu trong hệ sinh thái sẽ dựa trên bằng chứng hiện tại, bao gồm cả nghiên cứu Tương lai mở gần đây về tài liệu minh bạch. Nghiên cứu sâu hơn sẽ tập trung vào việc trao quyền cho những người không chuyên và các cộng đồng bằng thông tin minh bạch, đồng thời hiểu rõ các rào cản và cơ hội để những người thực hành AI truyền đạt tính minh bạch của dữ liệu một cách hiệu quả.

Mặc dù tính minh bạch không thể được coi là “viên đạn bạc” để giải quyết các thách thức về đạo đức liên quan đến hệ thống AI hoặc xây dựng lòng tin, nhưng đó là điều kiện tiên quyết để đưa ra quyết định sáng suốt và các hình thức can thiệp khác như các quy định. Nếu bạn quan tâm đến việc cộng tác với chúng tôi trong nghiên cứu và vận động đang diễn ra của chúng tôi trong lĩnh vực này hoặc muốn thảo luận thêm về công việc này, vui lòng liên hệ.

There is very little transparency about the data used in AI systems - a fact that is causing growing concern as these systems are increasingly deployed with real-world consequences.

As AI systems become increasingly used in everyday work and life, understanding key aspects of how these systems have been created and how far to trust the outcomes is becoming more and more essential.

As we outline in an article we recently published in the Harvard Business Review, enormous, unwieldy and opaque data sources are used as the basis for producing the generative AI systems’ outcomes. The failure to publicly document the contents and usage of datasets hampers the ability of developers, researchers, ethicists, and lawmakers to address various issues such as biases, harmful content, copyright concerns, and risks to personal or sensitive data. This lack of documentation spans all data elements, including training and fine-tuning datasets, as well as the sourcing and labelling processes.

The demand for AI transparency has become increasingly recognised in recent years. This has led to parts of the AI community making significant progress and contributions to AI data transparency, including the increasing emergence and uptake of standardised transparency guidelines. For example, but not limited to, Hugging Face, a vast repository of AI models and datasets, promoting the use of Model Cards and Dataset Cards to its community of developers. In another example, the Croissant initiative, supported by major platforms like Tensorflow and Hugging Face, provides machine-readable metadata (information about the datasets) for machine learning (ML) datasets, improving their accessibility, discoverability, and reproducibility and also helping to improve the management and accountability of work with the datasets by AI practitioners. All these resources guide developers on documenting how a model dataset was created and what it contains as well as potential legal or ethical issues to consider when working with it. Lawmakers are also responding to increasing demands by proposing legislation that specifically addresses AI data transparency - a topic we discuss further in our first policy position (of five we will publish in total) on what is needed to build the strong data infrastructure needed to realise responsible AI.

Amidst the need for greater transparency in AI data practices, a lack of systematic monitoring methods persists across many systems. An October 2023 study of 10 key generative AI (‘foundation’) models by Stanford researchers highlighted that among general low transparency across AI system development, transparency about data is particularly poor. A recently released update to the study included several more models and noted slight improvement by some developers, but overall, there is still poor data transparency.

In a forthcoming study by members of our data-centric AI research team, we replicated the analysis on a wider range of 54 AI systems that are causing public concern, having been at the centre of AI incidents recorded in the Partnership of AI's AI Incidents Database. We found that only a minority of these AI systems provided identifiable information about their underlying models and data practices. Transparency scores (evaluated for those systems offering basic model transparency information) were low across all indicators including data size, data sources and curation, with each indicator present in less than 40% of the models evaluated. Almost none of the systems scored included information about the inclusion of copyrighted data, personal information in data, or the use of data licences.

To build on our findings, we are developing an AI data transparency index to provide a clearer picture of how data transparency varies across different types of system providers, based on a deeper understanding of the needs for such information. Investigating the need for data transparency within the ecosystem will build on current evidence, including recent Open Futures research on transparency documentation. Further research will focus on empowering non-specialists and communities with transparency information, and on understanding the barriers and opportunities for AI practitioners to communicate data transparency effectively.

While transparency cannot be considered a ‘silver bullet’ for addressing the ethical challenges associated with AI systems, or building trust, it is a prerequisite for informed decision-making and other forms of intervention like regulation. If you are interested in collaborating with us on our ongoing research and advocacy in this area or would like to discuss this work further, please get in touch.

Dịch: Lê Trung Nghĩa

letrungnghia.foss@gmail.com

Tác giả: Nghĩa Lê Trung