Loading Huggingface Dataset

2 years ago

#37793

cookie1986

I am attempting to load the 'wiki40b' dataset here, based on the instructions provided by Huggingface here. Because the file is potentially so large, I am attempting to load only a small subset of the data. In the below, I try to load the Danish language subset:

from datasets import load_dataset
dataset = load_dataset('wiki40b', 'da')

When I run this, I get the following:

MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/ If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory). Example of usage: load_dataset('wiki40b', 'da', beam_runner='DirectRunner')

Given that the Danish dataset is small, I was hoping to load the data locally - as such I re-ran the script with the DirectRunner...

This however results in the following:

AttributeError: 'NoneType' object has no attribute 'projectNumber' dataset = load_dataset('wiki40b', 'da', beam_runner='DirectRunner')

I'm fairly inexperienced with this, and I'm not really sure where to turn next.

python

apache-beam

huggingface-datasets

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs