Comments (14)
Thanks
Few questions:
- Will you be willing to share some logs along the timeline where you experienced duplicates?
- Where have you seen duplicates? In kafka table on snowflake side?
from snowflake-kafka-connector.
What exact you'd like to see? I can do a grep and get you exact logs
from snowflake-kafka-connector.
Duplicate in Kafka Snowflake tables (where it dumps data in snowflake) @sfc-gh-japatel
from snowflake-kafka-connector.
What version of SF KC are you using?
Logs to see if there was any exception or unique scenario to help us better understand where and how duplicates would occur
from snowflake-kafka-connector.
Entire log would be huge, but if you can comment what exactly needs to be seen in logs? I can do a grep and get exact what you need
version - 2.0.0
from snowflake-kafka-connector.
Few more questions/asks:
- Can you use latest version https://github.com/snowflakedb/snowflake-kafka-connector/releases/tag/v2.0.1
- Can you repro this issue everytime?
- Logs around the time you see duplicate candidates in your table in SF. I want to see where because of which exception this is occurring.
- I also see 38 topics in your single connector, do you have any conflicts here in terms of topic2tablemap? i.e does all of them go to same table or different tables?
from snowflake-kafka-connector.
One thing I want to mention,
I try to reload data a lot of times, and what I do is recreate sink connector with new name, so it loads from beginning of the topic.
I dont change the base table name in snowflake, so it always puts data in same table, from same kafka topic, but a different sink connector name.
from snowflake-kafka-connector.
I see, well this might be one of the reasons why you see duplicates. Can you always choose a new table in snowflake?
Also we have had bug in 2.0.0 which got fixed in 2.0.1 which could see missing data if you have lot of rebalances.
We havent seen duplicate issue but what you mention could be the reason.
from snowflake-kafka-connector.
Again, The Way I have setted up the whole setup is, there are streams on top of tables.
Issue is what Im seeing on Kafka Broker/Eventhub, Data going out x2 times.
from snowflake-kafka-connector.
Hope it doesn't have to do anything with the Offset resetting.
Also if there is a network error, doesn't it have to check the offset value before reloading data? @sfc-gh-japatel
from snowflake-kafka-connector.
Wo do that today! please verify from snowflake side that offset is ingested twice on your landing table. metadata column can be used for that and you can run some queries to identify if you have duplicates.
But if you restart the connector with same name you are bound to get this issue since new connector name means new consumer group and it will start from offset 0.
I would recommend:
- upgrading to newer version
- see if you can repro this again.
- send logs if you can.
- Confirm how you are checking duplicate data on snowflake side. sql Query if possible.
Thanks
from snowflake-kafka-connector.
from snowflake-kafka-connector.
https://docs.snowflake.com/en/user-guide/kafka-connector-overview#schema-of-tables-for-kafka-topics
Here is what I use
select
record_metadata:"topic"::string as TOPIC,
record_metadata:"partition"::number as PARTITION_NO,
count(distinct record_metadata:"offset"::number) as unique_offsets_per_partition,
count(record_metadata:"offset"::number) as count_per_partitions,
min(record_metadata:"offset"::number) as min_offset,
max(record_metadata:"offset"::number) as max_offset,
(max_offset-min_offset+1) as diff_min_max_offsets,
case diff_min_max_offsets-unique_offsets_per_partition
when 0 then 'Nothing Missing'
else 'Missing Data/Duplicate Data'
end as data_completeness
from <base_table>
group by TOPIC, PARTITION_NO
order by PARTITION_NO
;
from snowflake-kafka-connector.
Hi @imrohankataria closing this since there has been no activity, feel free to open again with more details on how you were able to reproduce duplicate data. We would definitely take that as a priority. thanks
from snowflake-kafka-connector.
Related Issues (20)
- AvroConverter with SMTs HOT 1
- Schema Evolution failing - Quoted field not being found in columnNames HOT 5
- Option to not create new columns in existing table HOT 2
- Ingesting real-time change data to Snowflake HOT 5
- Configuring temporal type formats with Snowpipe Streaming HOT 3
- Transactional ingest HOT 4
- NDJSON support. HOT 4
- `buffer.flush.time` setting not working correctly. HOT 10
- Struct array elements are being serialized before writing to Array column HOT 4
- CVE-2023-39410 in Snowflake-kafka-connect JAR HOT 1
- Where is the 2.1.1 release? HOT 2
- Using configProviders other than file fails validation HOT 3
- google.cloud.storage.StorageException: 401 Unauthorized, causing RECORD_METADATA Exception: Invalid column name HOT 3
- Snowflake Sink committing offsets for null/tombstone messages HOT 5
- `google.cloud.storage.StorageException: 401 Unauthorized` HOT 3
- SNOW-989387 Connectors errored out after updating to v2.1.2 HOT 18
- Insert order when using snowpipe streaming HOT 2
- Streaming Channel Offset Migration (transient exception) HOT 4
- SCHEMA EVOLUTION: Converting a String to Number Column HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from snowflake-kafka-connector.