soury commited on
Commit
0bc4257
·
1 Parent(s): 0ae53cb

2nd model rf

Browse files
Files changed (4) hide show
  1. .gitignore +3 -2
  2. README.md +36 -61
  3. models/audio_classification_rf.pkl +3 -0
  4. tasks/audio.py +42 -21
.gitignore CHANGED
@@ -6,7 +6,9 @@ __pycache__/
6
  .env
7
  .ipynb_checkpoints
8
  .vscode/
9
-
 
 
10
  eval-queue/
11
  eval-results/
12
  eval-queue-bk/
@@ -14,4 +16,3 @@ eval-results-bk/
14
  logs/
15
 
16
  emissions.csv
17
- notebooks/test.ipynb
 
6
  .env
7
  .ipynb_checkpoints
8
  .vscode/
9
+ notebooks
10
+ Pipfile
11
+ Pipfile.lock
12
  eval-queue/
13
  eval-results/
14
  eval-queue-bk/
 
16
  logs/
17
 
18
  emissions.csv
 
README.md CHANGED
@@ -7,65 +7,40 @@ sdk: docker
7
  pinned: false
8
  ---
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- # Random Baseline Model for Climate Disinformation Classification
12
-
13
- ## Model Description
14
-
15
- This is a random baseline model for the Frugal AI Challenge 2024, specifically for the text classification task of identifying climate disinformation. The model serves as a performance floor, randomly assigning labels to text inputs without any learning.
16
-
17
- ### Intended Use
18
-
19
- - **Primary intended uses**: Baseline comparison for climate disinformation classification models
20
- - **Primary intended users**: Researchers and developers participating in the Frugal AI Challenge
21
- - **Out-of-scope use cases**: Not intended for production use or real-world classification tasks
22
-
23
- ## Training Data
24
-
25
- The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
26
- - Size: ~6000 examples
27
- - Split: 80% train, 20% test
28
- - 8 categories of climate disinformation claims
29
-
30
- ### Labels
31
- 0. No relevant claim detected
32
- 1. Global warming is not happening
33
- 2. Not caused by humans
34
- 3. Not bad or beneficial
35
- 4. Solutions harmful/unnecessary
36
- 5. Science is unreliable
37
- 6. Proponents are biased
38
- 7. Fossil fuels are needed
39
-
40
- ## Performance
41
-
42
- ### Metrics
43
- - **Accuracy**: ~12.5% (random chance with 8 classes)
44
- - **Environmental Impact**:
45
- - Emissions tracked in gCO2eq
46
- - Energy consumption tracked in Wh
47
-
48
- ### Model Architecture
49
- The model implements a random choice between the 8 possible labels, serving as the simplest possible baseline.
50
-
51
- ## Environmental Impact
52
-
53
- Environmental impact is tracked using CodeCarbon, measuring:
54
- - Carbon emissions during inference
55
- - Energy consumption during inference
56
-
57
- This tracking helps establish a baseline for the environmental impact of model deployment and inference.
58
-
59
- ## Limitations
60
- - Makes completely random predictions
61
- - No learning or pattern recognition
62
- - No consideration of input text
63
- - Serves only as a baseline reference
64
- - Not suitable for any real-world applications
65
-
66
- ## Ethical Considerations
67
-
68
- - Dataset contains sensitive topics related to climate disinformation
69
- - Model makes random predictions and should not be used for actual classification
70
- - Environmental impact is tracked to promote awareness of AI's carbon footprint
71
- ```
 
7
  pinned: false
8
  ---
9
 
10
+ # Classification Model for Climate Disinformation Classification
11
+
12
+ ## Global Informations
13
+
14
+ The aim of this model is to detect illegal deforestation thanks to audio clips. Our objective is to make this AI system the most frugal possible.
15
+ When you're new to AI for audio processing and you're looking for information on the Internet, the following methodology is often described: transform the audio signal into a spectrogram (2D image) and then have it analyzed by a CNN. It can be necessary when you're working on very precise task such as transcription, but it's too heavy for our task, which is simply to detect chainsaw noises. So, for our baseline, we used an mfcc transform to preprocess the data with, then we transformed the 2D output un a 1D by taking the mean for each feature and finally we applied a basic ml classification algorithm like Random Forest.
16
+ Then, we tried to optimize the 2 different stages: the data preprocessing (to make it simple, quick and to create the smallest possible preprocessed dataset) and the ml model itself. At this point, we noticed than the preprocessing of our data (with the mfcc transformation) was consuming 12 times more than our model training and inference so we mostly worked on optimizing the data preprocessing part. Here are our main ideas :
17
+ 1. The data preprocessing:
18
+ * Compare different methods of audio feature extraction
19
+ * Decrease the size of the data to analyse (i.e. by decreasing the sampling rate, take only a small sample of the initial audio of 3s, since we don’t really need several seconds to identify a sound, remove unnecessary data (as with plots we can see that the characteristic sounds of chainsraw are observed between 200Hz & 1000Hz))
20
+ * Decrease the number of features extracted
21
+
22
+ 2. The ml model:
23
+ * Use the most lightweighted model : Avoid neural networks and compare basic ml classification algorithms for example knn is 3 times less energy consuming than the Random Forest with a non-significant loss of accuracy
24
+
25
+ ## Submitted models
26
+ Model 1:
27
+ * Data preprocessing: Resampling to 6000Hz and using librosa mfcc method to calculate 7 MFCC
28
+ * Energy consumption for the processing of the training dataset : 0.005349 kWh
29
+ * Model: KNN
30
+ * Energy consumption for the training of the model : 0.000002 kWh
31
+
32
+ Model 2:
33
+ * Data preprocessing: Resampling to 6000Hz and using librosa mfcc method to calculate 10 MFCC
34
+ * Energy consumption for the processing of the training dataset : 0.005648 kWh
35
+ * Model: Random Forest
36
+ * Energy consumption for the training of the model : 0.000326 kWh
37
+
38
+ The 2nd model has better accuracy but is less energy efficient.
39
+
40
+ ## Other avenues for optimization
41
+
42
+ We learned a lot about all the possible optimizations but unfortunately, we didn't keep all of them for the final submission, as they led to an excessive loss of precision (~85% accuracy), they are described below :
43
+ * Use a less complex method of audio feature extraction (example: spectral centroid is 5 times faster than mfcc)
44
+ * Focus our analysis on the frequencies that really matter for chainsaws (from 0 to 20,000Hz)
45
+ * Use less data (shortest audio), i.e. keep randomly only 0.2s of audio for each audio of the dataset before extracting features with mfcc and then also take only a small sample of each audio of the test dataset to predict the label
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
models/audio_classification_rf.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d180471cec554f4b107a578e7badc3972e70a5702efb6258a0f4ba6172808a9b
3
+ size 6603238
tasks/audio.py CHANGED
@@ -2,8 +2,10 @@ from fastapi import APIRouter
2
  from datetime import datetime
3
  from datasets import load_dataset
4
  from sklearn.metrics import accuracy_score
5
- import random
6
  import os
 
 
 
7
 
8
  from .utils.evaluation import AudioEvaluationRequest
9
  from .utils.emissions import tracker, clean_emissions_data, get_space_info
@@ -13,17 +15,16 @@ load_dotenv()
13
 
14
  router = APIRouter()
15
 
16
- DESCRIPTION = "Random Baseline"
17
  ROUTE = "/audio"
18
 
19
 
20
-
21
  @router.post(ROUTE, tags=["Audio Task"],
22
  description=DESCRIPTION)
23
  async def evaluate_audio(request: AudioEvaluationRequest):
24
  """
25
  Evaluate audio classification for rainforest sound detection.
26
-
27
  Current Model: Random Baseline
28
  - Makes random predictions from the label space (0-1)
29
  - Used as a baseline for comparison
@@ -38,35 +39,56 @@ async def evaluate_audio(request: AudioEvaluationRequest):
38
  }
39
  # Load and prepare the dataset
40
  # Because the dataset is gated, we need to use the HF_TOKEN environment variable to authenticate
41
- dataset = load_dataset(request.dataset_name,token=os.getenv("HF_TOKEN"))
42
-
43
  # Split dataset
44
  train_test = dataset["train"]
45
  test_dataset = dataset["test"]
46
-
47
  # Start tracking emissions
48
  tracker.start()
49
  tracker.start_task("inference")
50
-
51
- #--------------------------------------------------------------------------------------------
52
  # YOUR MODEL INFERENCE CODE HERE
53
  # Update the code below to replace the random baseline by your model inference within the inference pass where the energy consumption and emissions are tracked.
54
- #--------------------------------------------------------------------------------------------
55
-
56
- # Make random predictions (placeholder for actual model inference)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  true_labels = test_dataset["label"]
58
- predictions = [random.randint(0, 1) for _ in range(len(true_labels))]
59
-
60
- #--------------------------------------------------------------------------------------------
61
  # YOUR MODEL INFERENCE STOPS HERE
62
- #--------------------------------------------------------------------------------------------
63
-
64
  # Stop tracking emissions
65
  emissions_data = tracker.stop_task()
66
-
67
  # Calculate accuracy
68
  accuracy = accuracy_score(true_labels, predictions)
69
-
70
  # Prepare results dictionary
71
  results = {
72
  "username": username,
@@ -84,5 +106,4 @@ async def evaluate_audio(request: AudioEvaluationRequest):
84
  "test_seed": request.test_seed
85
  }
86
  }
87
-
88
- return results
 
2
  from datetime import datetime
3
  from datasets import load_dataset
4
  from sklearn.metrics import accuracy_score
 
5
  import os
6
+ import joblib
7
+ import librosa
8
+ import numpy as np
9
 
10
  from .utils.evaluation import AudioEvaluationRequest
11
  from .utils.emissions import tracker, clean_emissions_data, get_space_info
 
15
 
16
  router = APIRouter()
17
 
18
+ DESCRIPTION = "Model 2 : Random Forest audio classification"
19
  ROUTE = "/audio"
20
 
21
 
 
22
  @router.post(ROUTE, tags=["Audio Task"],
23
  description=DESCRIPTION)
24
  async def evaluate_audio(request: AudioEvaluationRequest):
25
  """
26
  Evaluate audio classification for rainforest sound detection.
27
+
28
  Current Model: Random Baseline
29
  - Makes random predictions from the label space (0-1)
30
  - Used as a baseline for comparison
 
39
  }
40
  # Load and prepare the dataset
41
  # Because the dataset is gated, we need to use the HF_TOKEN environment variable to authenticate
42
+ dataset = load_dataset(request.dataset_name, token=os.getenv("HF_TOKEN"))
43
+
44
  # Split dataset
45
  train_test = dataset["train"]
46
  test_dataset = dataset["test"]
47
+
48
  # Start tracking emissions
49
  tracker.start()
50
  tracker.start_task("inference")
51
+
52
+ # --------------------------------------------------------------------------------------------
53
  # YOUR MODEL INFERENCE CODE HERE
54
  # Update the code below to replace the random baseline by your model inference within the inference pass where the energy consumption and emissions are tracked.
55
+ # --------------------------------------------------------------------------------------------
56
+ # data formatting
57
+
58
+ def preprocess(dataset):
59
+ features = []
60
+ for row in dataset:
61
+ # Load the audio file and resample it
62
+ target_sr = 6000
63
+ audio = row['audio']['array']
64
+ audio = librosa.resample(audio, orig_sr=12000, target_sr=target_sr)
65
+
66
+ # Extract MFCC features
67
+ mfccs = librosa.feature.mfcc(y=audio, sr=target_sr, n_mfcc=10)
68
+ mfccs_scaled = np.mean(mfccs.T, axis=0)
69
+
70
+ # Append features and labels
71
+ features.append(mfccs_scaled)
72
+
73
+ return np.array(features)
74
+
75
+ X_test = preprocess(test_dataset)
76
+
77
+ classification_model = joblib.load("./models/audio_classification_rf.pkl")
78
+
79
+ predictions = classification_model.predict(X_test)
80
  true_labels = test_dataset["label"]
81
+
82
+ # --------------------------------------------------------------------------------------------
 
83
  # YOUR MODEL INFERENCE STOPS HERE
84
+ # --------------------------------------------------------------------------------------------
85
+
86
  # Stop tracking emissions
87
  emissions_data = tracker.stop_task()
88
+
89
  # Calculate accuracy
90
  accuracy = accuracy_score(true_labels, predictions)
91
+
92
  # Prepare results dictionary
93
  results = {
94
  "username": username,
 
106
  "test_seed": request.test_seed
107
  }
108
  }
109
+ return results